# Intro to Analytics

## Introduction to Scikit learn

In [None]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns

#make the plots show up inline
%matplotlib inline 

### Scikit-Learn's Estimator API

Most commonly, the steps in using the Scikit-Learn estimator API are as follows (we will step through a handful of detailed examples in the sections that follow).

1. Choose a class of model by importing the appropriate estimator class from Scikit-Learn.

2. Choose model hyperparameters by instantiating this class with desired values.

3. Arrange data into a features matrix and target vector following the discussion above.

4. Fit the model to your data by calling the fit() method of the model instance.

5. Apply the model to new data:
    * For supervised learning, often we predict labels for unknown data using the predict() method.
    * For unsupervised learning, we often transform or infer properties of the data using the transform() or predict() method.


## Example using Fake Data

### Step 0- Generating Synthetic data

This is a fancy way of saying fake data. 


In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

# Generate the synthetic data
np.random.seed(0)  # this seed allows us to reproduce the data across machines
X = 2.5 * np.random.randn(1000) + 1.5   # Array of 1000 values with mean = 1.5, stddev = 2.5
y = X * 2 + np.random.randn(1000) * 2  # Actual values of Y

# Splitting the data into training and testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Visualizing the generated data
plt.scatter(X_train, y_train, color='blue', label='Training data')
plt.scatter(X_test, y_test, color='green', label='Testing data')
plt.title("Synthetic Linear Data")
plt.xlabel("X")
plt.ylabel("y")
plt.legend()
plt.show()


### Step 1. Choosing a Model Class
For a linear regression problem, we will use the LinearRegression estimator class from Scikit-Learn's linear_model module.

In [None]:
#Step 1
from sklearn.linear_model import LinearRegression


### Step 2. Choosing Model Hyperparameters
When instantiating the LinearRegression class, we can specify hyperparameters. For our simple linear regression, we'll use the default settings, meaning no regularization.

In [None]:
#Step 2 - choose hyperparamenters
model = LinearRegression()


### Step 3. Arranging Data into a Features Matrix and Target Vector
Our data is already split into a features matrix X and a target vector y, as required by Scikit-Learn's API. We need to ensure X is in the correct shape (a two-dimensional array), especially when dealing with a single feature.

In [None]:
X_train = X_train.reshape(-1, 1)
X_test = X_test.reshape(-1, 1)


### Step 4. Fitting the Model to Your Data
Now, we train our model on the training data using the fit() method.

In [None]:
model.fit(X_train, y_train)


### Step 5. Applying the Model to New Data
Finally, we can make predictions using our trained model. For supervised learning tasks like ours, we use the predict() method.

In [None]:
y_pred = model.predict(X_test)

# Visualizing the model's predictions alongside the original data
plt.scatter(X_test, y_test, color='green', label='Testing data')
plt.plot(X_test, y_pred, color='red', label='Model Prediction')
plt.title("Model Predictions vs. Testing Data")
plt.xlabel("X")
plt.ylabel("y")
plt.legend()
plt.show()


### Getting more results

After fitting the model to your training data, you can access the model's slope (coefficient) and intercept directly via the .coef_ and .intercept_ attributes. To evaluate the performance of your model, particularly how well it generalizes to unseen data, you can use the .score() method to compute the $R^{2}$ score on the test dataset.

In [None]:
# The below assunmes `model` is your LinearRegression instance and you've already fit 
# it to your training data. Later we will change the name of our model instance, so watch this

# Extracting the coefficient and intercept
coefficient = model.coef_
intercept = model.intercept_

# Calculating the R^2 score
r2_score = model.score(X_test, y_test)

print(f"Coefficient (Slope): {coefficient[0]}")
print(f"Intercept: {intercept}")
print(f"R^2 Score: {r2_score}")


## Learning about data shape

### Example
When working with a pandas DataFrame and selecting predictor variables for a regression model, you may need to reshape your data to fit the expected input format for Scikit-Learn models. Typically, Scikit-Learn expects the features (X) to be a two-dimensional array (matrix) of shape (n_samples, n_features) and the target (y) to be a one-dimensional array of shape (n_samples,). Here's how you can use the .reshape() method and other techniques to prepare your data correctly:

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# Example DataFrame
df = pd.DataFrame({
    'feature1': np.random.rand(100),
    'target': np.random.rand(100)
})

# Selecting the predictor and target variables
#X = df[['feature1']]  # This keeps X as a DataFrame, which is already 2D.
#y = df['target']

# Alternatively, if you select the column as a Series, reshape is required:
X = df['feature1'].values.reshape(-1, 1)  # Reshape to 2D array
y = df['target'].values  # y can stay as 1D array

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [None]:
y

### Scenario 1: Single Predictor Variable
If your DataFrame has a single predictor variable, selecting this column will result in a pandas Series. You need to reshape this Series into a two-dimensional array.

Suppose you have a DataFrame df with a column 'feature1' as your predictor and 'target' as your target variable.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# Example DataFrame
df = pd.DataFrame({
    'feature1': np.random.rand(100),
    'target': np.random.rand(100)
})

# Selecting the predictor and target variables
X = df[['feature1']]  # This keeps X as a DataFrame, which is already 2D.
y = df['target']

# Alternatively, if you select the column as a Series, reshape is required:
#X = df['feature1'].values.reshape(-1, 1)  # Reshape to 2D array
#y = df['target'].values  # y can stay as 1D array

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


Note: Using df[['feature1']] (note the double brackets) keeps X as a DataFrame, which is inherently two-dimensional. If you use df['feature1'].values or df['feature1'].to_numpy(), it returns a one-dimensional NumPy array, hence the need for .reshape(-1, 1) to convert it into a 2D array.

### Scenario 2: Multiple Predictor Variables
If you're selecting multiple predictor variables, pandas will keep the data in a two-dimensional structure, which is what Scikit-Learn expects.

In [None]:
# Set seed for reproducibility
np.random.seed(1953)

# Generate synthetic data
n_samples = 100
feature1 = np.random.rand(n_samples) * 10  # Feature 1: Random values scaled up to 10
feature2 = np.random.rand(n_samples) * 20  # Feature 2: Random values scaled up to 20
feature3 = np.random.rand(n_samples) * 5   # Feature 3: Random values scaled up to 5

# Create a target variable with a linear combination of features plus some noise
target = 3 * feature1 + 2 * feature2 - 4 * feature3 + np.random.randn(n_samples) * 3

# Create a DataFrame
df = pd.DataFrame({
    'feature1': feature1,
    'feature2': feature2,
    'feature3': feature3,
    'target': target
})

# the first 5 rows 
print(df.head())


In [None]:
# Assuming 'feature1', 'feature2', 'feature3' are your predictors
X = df[['feature1', 'feature2', 'feature3']]
y = df['target'].values

# Splitting the data into training and testing sets remains the same
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In this case, X is already in the correct shape because selecting multiple columns from a DataFrame results in another DataFrame.

### Pro tip
When you encounter an error related to the shape of your data, especially with Scikit-Learn, it often helps to check the dimensions of your arrays with .shape:

In [None]:
print(X.shape)  # Should be (n_samples, n_features)
print(y.shape)  # Should be (n_samples,)


## More on Splitting data

Splitting data into training and testing sets is a fundamental practice in machine learning to evaluate the performance of a model. It helps in understanding how well the model generalizes to unseen data. To demonstrate its value, let's create an example using Scikit-Learn and Python where we compare the performance of a model on the training set versus the testing set.

We'll use a simple linear regression model with synthetic data. 

### Example

#### Generating Synthetic Data
First, we generate synthetic data that has a linear relationship, but with added noise to simulate real-world data imperfections.

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split

#step 1: Choose a model class. 
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt

# Set seed for reproducibility
np.random.seed(42)

# Generate synthetic data
X = np.random.rand(100, 1) * 10  # 100 data points in the range 0-10
y = 2 * X.squeeze() + 1 + np.random.randn(100) * 2  # Linear relation with noise

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


#### Training the Model
Next, we train a linear regression model on the training set and evaluate its performance both on the training set and the testing set.

In [None]:
# Step 2: choose the model hyperparameters
model = LinearRegression()

In [None]:
# Step 3: Arrange the data
print(X.shape)  # Should be (n_samples, n_features)
print(y.shape)  # Should be (n_samples,)

In [None]:
# Step 4: Fit the model
model.fit(X_train, y_train)

# Predictions
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

# Performance evaluation
train_mse = mean_squared_error(y_train, y_train_pred)
test_mse = mean_squared_error(y_test, y_test_pred)
train_r2 = r2_score(y_train, y_train_pred)
test_r2 = r2_score(y_test, y_test_pred)

print(f"Training MSE: {train_mse:.2f}, Training R^2: {train_r2:.2f}")
print(f"Testing MSE: {test_mse:.2f}, Testing R^2: {test_r2:.2f}")


The mean squared error (MSE) and $R^{2}$ score on the training and testing sets will give us an idea of how well the model performs. A significant difference in performance between the training and testing sets can indicate overfitting: the model performs well on the training data but fails to generalize to new, unseen data. Splitting the data into training and testing sets helps us detect this issue and take steps to address it, such as simplifying the model, using regularization techniques, or gathering more data.

In [None]:
plt.figure(figsize=(10, 5))

# Plot training data
plt.subplot(1, 2, 1)
plt.scatter(X_train, y_train, color='blue', label='Training data')
plt.plot(X_train, y_train_pred, color='red', label='Model Prediction')
plt.title("Model Fit on Training Data")
plt.xlabel("X")
plt.ylabel("y")
plt.legend()

# Plot testing data
plt.subplot(1, 2, 2)
plt.scatter(X_test, y_test, color='green', label='Testing data')
plt.plot(X_test, y_test_pred, color='red', label='Model Prediction')
plt.title("Model Prediction on Testing Data")
plt.xlabel("X")
plt.ylabel("y")
plt.legend()

plt.tight_layout()
plt.show()


### An overfit example

We will again use some synthetic data (fake). We are going to fit two models to it: one with a reasonable complexity for the data and another with excessive complexity. 

Overfitting occurs when a model learns the training data **too well**, capturing noise as if it were a true pattern, which negatively impacts its performance on new, unseen data.

We'll use polynomial regression as our example. A simple linear regression model will serve as the reasonably complex model, and a high-degree polynomial regression model will serve as the overly complex model. We'll use Scikit-Learn for model fitting and matplotlib for visualization.

Don't get lost in the model... this is just to demonstrate overfitting

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_squared_error

# Generate synthetic data
np.random.seed(0)
X = np.linspace(-3, 3, 100).reshape(-1, 1)
y = 0.5 * X.squeeze() ** 2 + np.random.randn(100) * 1.5 + 2

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


In [None]:
# Very simple: Linear regression model
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)

# Very complex: High-degree polynomial regression model
poly_model = make_pipeline(PolynomialFeatures(degree=15), LinearRegression())
poly_model.fit(X_train, y_train)


In [None]:
# Predictions
y_pred_train_linear = linear_model.predict(X_train)
y_pred_test_linear = linear_model.predict(X_test)

y_pred_train_poly = poly_model.predict(X_train)
y_pred_test_poly = poly_model.predict(X_test)

# MSE calculations
mse_train_linear = mean_squared_error(y_train, y_pred_train_linear)
mse_test_linear = mean_squared_error(y_test, y_pred_test_linear)

mse_train_poly = mean_squared_error(y_train, y_pred_train_poly)
mse_test_poly = mean_squared_error(y_test, y_pred_test_poly)

# Display MSE
print(f"Linear Regression - Training MSE: {mse_train_linear:.2f}, Testing MSE: {mse_test_linear:.2f}")
print(f"Polynomial Regression - Training MSE: {mse_train_poly:.2f}, Testing MSE: {mse_test_poly:.2f}")

# Plotting
plt.figure(figsize=(14, 6))

# Linear model
plt.subplot(1, 2, 1)
plt.scatter(X_train, y_train, color='lightgray', label='Training data')
plt.scatter(X_test, y_test, color='gold', label='Testing data')
plt.plot(np.sort(X_train.squeeze()), linear_model.predict(np.sort(X_train, axis=0)), color='red', label='Linear Model')
plt.title("Linear Regression Fit")
plt.xlabel("X")
plt.ylabel("y")
plt.legend()

# Polynomial model
plt.subplot(1, 2, 2)
X_fit = np.linspace(-3, 3, 100).reshape(-1, 1)
plt.scatter(X_train, y_train, color='lightgray', label='Training data')
plt.scatter(X_test, y_test, color='gold', label='Testing data')
plt.plot(X_fit, poly_model.predict(X_fit), color='blue', label='Polynomial Model')
plt.title("Polynomial Regression Fit")
plt.xlabel("X")
plt.ylabel("y")
plt.legend()

plt.tight_layout()
plt.show()
