<a href="https://colab.research.google.com/github/Nikhil90398/NR_Playstore_Data_Analysis_Team_Kaggle/blob/main/REGRESSION_BLUEPRINT-01.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Topic using Linear regression 

The following topics are covered in this colab :

- A typical problem statement for machine learning
- Downloading and exploring a dataset for machine learning
- Linear regression with one variable using Scikit-learn
- Linear regression with multiple variables 
- Using categorical features for machine learning
- Regression coefficients and feature importance
- Other models and techniques for regression using Scikit-learn
- Applying linear regression to other datasets


# Problem Statement

This tutorial takes a practical and coding-focused approach. We'll define the terms _machine learning_ and _linear regression_ in the context of a problem, and later generalize their definitions. We'll work through a typical machine learning problem step-by-step:


> **QUESTION**: ACME Insurance Inc. offers affordable health insurance to thousands of customer all over the United States. As the lead data scientist at ACME, **you're tasked with creating an automated system to estimate the annual medical expenditure for new customers**, using information such as their age, sex, BMI, children, smoking habits and region of residence. 
>
> Estimates from your system will be used to determine the annual insurance premium (amount paid every month) offered to the customer. Due to regulatory requirements, you must be able to explain why your system outputs a certain prediction.

#Step 1 - Download and Explore the Data

The dataset is available as a ZIP file at the following url:

In [None]:
dataset_url = 

In [None]:
!pip install opendatasets

In [None]:
import opendatasets as od
od.download(dataset_url)

Skipping, found downloaded files in "./calcofi" (use force=True to force download)


In [None]:
import os
data_dir='/content/calcofi'
os.listdir(data_dir)

['bottle.csv', 'cast.csv']

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
path= data_dir +".......csv"

  exec(code_obj, self.user_global_ns, self.user_ns)


> **QUESTION 1**: Load the data from the file `train.csv` into a Pandas data frame.

In [None]:
df=pd.read_csv(path)

In [None]:
df

The dataset contains 1338 rows and 7 columns. Each row of the dataset contains information about one customer. 

Our objective is to find a way to estimate the value in the "charges" column using the values in the other columns. If we can do so for the historical data, then we should able to estimate charges for new customers too, simply by asking for information like their age, sex, BMI, no. of children, smoking habits and region.

Let's check the data type for each column.

In [None]:
bottle.info()

> **QUESTION 2**: How many rows and columns does the dataset contain? 

In [None]:
n_rows = df.shape[0]

In [None]:
n_cols = df.shape[1]

In [None]:
print('The dataset contains {} rows and {} columns.'.format(n_rows, n_cols))

## Exploratory Analysis and Visualization

Let's explore the data by visualizing the distribution of values in some columns of the dataset, and the relationships between "charges" and other columns.


* libraries that we are going to use in this collab 

In [None]:
import seaborn as sns
import plotly.express as px
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
#The following settings will improve the default style and font sizes for our charts
sns.set_style('darkgrid')
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (10, 6)
matplotlib.rcParams['figure.facecolor'] = '#00000000'

> **QUESTION 3**: How many `missing values` does the dataset contain in percentage? 

In [None]:
# Here we will check the percentage of nan values present in each feature
# 1 -step make the list of features which has missing values
feature_with_na=[feature for feature in df.columns if df[feature].isnull().sum()>1]
# 2- step print the feature name and the percentage of missing values
for feature in feature_with_na:
  print(feature, np.round(df[feature].isnull().mean(), 4)*100,  " % missing values")

In [None]:
#lets drop columns which have nan value above 40%
perc=40.0
min_count=int(((100-perc)/100)*bottle.shape[0] + 1)
bottle=bottle.dropna(axis=1,thresh=min_count)

* list of numerical variables

In [None]:
# list of numerical variables
numerical_features = [feature for feature in df.columns if df[feature].dtypes != "O"]  
print("number of numerical variables: ",len(numerical_features))    
# visualise the numerical columns
bottle[numerical_features].head()

* Discrete Variables

In [None]:
## Numerical variables are usually of 2 type
## 1 Discrete Variables
discrete_feature=[feature for feature in numerical_features if len(bottle[feature].unique())<25]
print("Discrete Variables Count: {}".format(len(discrete_feature)))

Discrete Variables Count: 13


In [None]:
for feature in discrete_feature:
  data=df.copy()
  data.groupby(feature)["T_degC"].sum().plot.bar()
  plt.xlabel(feature)
  plt.ylabel('temperature')
  plt.show()

* Continous variable

In [None]:
# 2-Continous variable
continuous_feature=[feature for feature in numerical_features if feature not in discrete_feature]
print("Continuous Variables Count {}".format(len(continuous_feature)))

In [None]:
for feature in continuous_feature:
    data=df.copy()
    data[feature].hist(bins=25)
    plt.xlabel(feature)
    plt.ylabel("Count")
    plt.title(feature)
    plt.show()

# Step 2 - Prepare the Dataset for Training


Before we can train the model, we need to prepare the dataset. Here are the steps we'll follow:

1. Identify the input and target column(s) for training the model.
2. Identify numeric and categorical input columns.
3. [Impute](https://scikit-learn.org/stable/modules/impute.html) (fill) missing values in numeric columns
4. [Scale](https://scikit-learn.org/stable/modules/preprocessing.html#scaling-features-to-a-range) values in numeric columns to a $(0,1)$ range.
5. [Encode](https://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features) categorical data into one-hot vectors.
6. Split the dataset into training and validation sets.


## Identify Inputs and Targets

While the dataset contains `81` columns, not all of them are useful for modeling. Note the following:

- The first column `Id` is a unique ID for each house and isn't useful for training the model.
- The last column `SalePrice` contains the value we need to predict i.e. it's the target column.
- Data from all the other columns (except the first and the last column) can be used as inputs to the model.
 

> **QUESTION 4**: Create a list `input_cols` of column names containing data that can be used as input to train the model, and identify the target column as the variable `target_col`.

In [None]:
# Identify the input columns (a list of column names)
input_cols = list(df.columns)[1:-1]

In [None]:
print(target_col)

In [None]:
# Identify the name of the target column (a single string, not a list)
target_col =list(prices_df.columns)[-1]

In [None]:
print(target_col)

Make sure that the `Id` and `SalePrice` columns are not included in `input_cols`.

Now that we've identified the input and target columns, we can separate input & target data.

In [None]:
inputs_df = df[input_cols]
targets = df[target_col]

##Identify Numeric and Categorical Data
The next step in data preparation is to identify numeric and categorical columns. We can do this by looking at the data type of each column.

> **QUESTION 5**: Crate two lists `numeric_cols` and `categorical_cols` containing names of numeric and categorical input columns within the dataframe respectively. Numeric columns have data types `int64` and `float64`, whereas categorical columns have the data type `object`.
>
> *Hint*: See this [StackOverflow question](https://stackoverflow.com/questions/25039626/how-do-i-find-numeric-columns-in-pandas). 

In [None]:
#numerical=medical.select_dtypes(include=np.number).columns.tolist()
numeric_cols = inputs_df.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_cols = inputs_df.select_dtypes(include=[object]).columns.tolist()

##Impute Numerical Data
Some of the numeric columns in our dataset contain missing values (nan)

In [None]:
missing_counts = inputs_df[numeric_cols].isna().sum().sort_values(ascending=False)
missing_counts[missing_counts > 0]

Machine learning models can't work with missing data. The process of filling missing values is called [imputation](https://scikit-learn.org/stable/modules/impute.html).

<img src="https://i.imgur.com/W7cfyOp.png" width="480">

There are several techniques for imputation, but we'll use the most basic one: replacing missing values with the average value in the column using the `SimpleImputer` class from `sklearn.impute`.


> **QUESTION 6**: Impute (fill) missing values in the numeric columns of `inputs_df` using a `SimpleImputer`. 

In [None]:
from sklearn.impute import SimpleImputer
# 1. Create the imputer
imputer = SimpleImputer(strategy = 'mean')

# 2. Fit the imputer to the numeric colums
imputer.fit(inputs_df[numeric_cols])

# 3. Transform and replace the numeric columns
inputs_df[numeric_cols] = imputer.transform(inputs_df[numeric_cols])

In [None]:
missing_counts = inputs_df[numeric_cols].isna().sum().sort_values(ascending=False)
missing_counts[missing_counts > 0]

##Scale Numerical Values
The numeric columns in our dataset have varying ranges.

In [None]:
inputs_df[numeric_cols].describe().loc[['min', 'max']]

A good practice is to [scale numeric features](https://scikit-learn.org/stable/modules/preprocessing.html#scaling-features-to-a-range) to a small range of values e.g. $(0,1)$. Scaling numeric features ensures that no particular feature has a disproportionate impact on the model's loss. Optimization algorithms also work better in practice with smaller numbers.


> **QUESTION 7**: Scale numeric values to the $(0, 1)$ range using `MinMaxScaler` from `sklearn.preprocessing`.

In [None]:
from sklearn.preprocessing import MinMaxScaler

# Create the scaler
scaler = MinMaxScaler()

# Fit the scaler to the numeric columns
scaler.fit(inputs_df[numeric_cols])

# Transform and replace the numeric columns
inputs_df[numeric_cols] = scaler.transform(inputs_df[numeric_cols])

After scaling, the ranges of all numeric columns should be (0, 1).

In [None]:
inputs_df[numeric_cols].describe().loc[['min', 'max']]

##Encode Categorical Columns
Our dataset contains several categorical columns, each with a different number of categories.

In [None]:
inputs_df[categorical_cols].nunique().sort_values(ascending=False)



Since machine learning models can only be trained with numeric data, we need to convert categorical data to numbers. A common technique is to use one-hot encoding for categorical columns.

<img src="https://i.imgur.com/n8GuiOO.png" width="640">

One hot encoding involves adding a new binary (0/1) column for each unique category of a categorical column.

> **QUESTION 8**: Encode categorical columns in the dataset as one-hot vectors using `OneHotEncoder` from `sklearn.preprocessing`. Add a new binary (0/1) column for each category

In [None]:
from sklearn.preprocessing import OneHotEncoder

# 1. Create the encoder
encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')

# 2. Fit the encoder to the categorical colums
encoder.fit(inputs_df[categorical_cols])

# 3. Generate column names for each category
encoded_cols = list(encoder.get_feature_names(categorical_cols))
len(encoded_cols)

In [None]:
# 4. Transform and add new one-hot category columns
inputs_df[encoded_cols] = encoder.transform(inputs_df[categorical_cols])

The new one-hot category columns should now be added to `inputs_df`.

##Training and Validation Set
Finally, let's split the dataset into a training and validation set. We'll use a randomly select 25% subset of the data for validation. Also, we'll use just the numeric and encoded columns, since the inputs to our model must be numbers.

In [None]:
from sklearn.model_selection import train_test_split
train_inputs, val_inputs, train_targets, val_targets = train_test_split(inputs_df[numeric_cols + encoded_cols], 
                                                                        targets, 
                                                                        test_size=0.25, 
                                                                        random_state=42)


In [None]:
train_inputs

In [None]:
train_target

In [None]:
val_inputs

In [None]:
val_targets

# Step 3 - Train a Linear Regression Model

We're now ready to train the model. Linear regression is a commonly used technique for solving [regression problems](https://jovian.ai/aakashns/python-sklearn-logistic-regression/v/66#C6). In a linear regression model, the target is modeled as a linear combination (or weighted sum) of input features. The predictions from the model are evaluated using a loss function like the Root Mean Squared Error (RMSE).


Here's a visual summary of how a linear regression model is structured:

<img src="https://i.imgur.com/iTM2s5k.png" width="480">

However, linear regression doesn't generalize very well when we have a large number of input columns with co-linearity i.e. when the values one column are highly correlated with values in other column(s). This is because it tries to fit the training data perfectly. 

Instead, we'll use Ridge Regression, a variant of linear regression that uses a technique called L2 regularization to introduce another loss term that forces the model to generalize better. Learn more about ridge regression here: https://www.youtube.com/watch?v=Q81RR3yKn30

> **QUESTION 9**: Create and train a linear regression model using the `Ridge` class from `sklearn.linear_model`.

In [None]:
from sklearn.linear_model import Ridge

# Create the model
model = Ridge(alpha=1)

# Fit the model using inputs and targets
model.fit(train_inputs,train_targets)

`model.fit` uses the following strategy for training the model (source):

1. We initialize a model with random parameters (weights & biases).
2. We pass some inputs into the model to obtain predictions.
3. We compare the model's predictions with the actual targets using the loss function.
4. We use an optimization technique (like least squares, gradient descent etc.) to reduce the loss by adjusting the weights & biases of the model
5. We repeat steps 1 to 4 till the predictions from the model are good enough.

<img src="https://www.deepnetts.com/blog/wp-content/uploads/2019/02/SupervisedLearning.png" width="480">

# Step 4 - Make Predictions and Evaluate Your Model

The model is now trained, and we can use it to generate predictions for the training and validation inputs. We can evaluate the model's performance using the RMSE (root mean squared error) loss function.

> **QUESTION 10**: Generate predictions and compute the RMSE loss for the training and validation sets. 
> 
> *Hint*: Use the `mean_squared_error` with the argument `squared=False` to compute RMSE loss.

In [None]:
train_preds =model.predict(train_inputs)

In [None]:
train_preds

In [None]:
from sklearn.metrics import mean_squared_error,r2_score
# Root mean square error
train_rmse=mean_squared_error(train_targets,train_preds,squared=False)
print('The RMSE loss for the training set is {}.'.format(train_rmse))
# r2_score
train_r2=r2_score(train_targets,train_preds)
print('The r2_score for the training set is {}.'.format(train_r2))

In [None]:
val_preds = model.predict(val_inputs)

In [None]:
val_preds

In [None]:
val_rmse =mean_squared_error(val_targets,val_preds,squared=False)
print('The RMSE loss for the validation set is {}.'.format(val_rmse))
train_r2=r2_score(val_targets,val_preds)
print('The r2_score for the validation set is {}.'.format(train_r2))

## Feature Importance

Let's look at the weights assigned to different columns, to figure out which columns in the dataset are the most important.

> **QUESTION 11**: Identify the weights (or coefficients) assigned to for different features by the model.
> 
> *Hint:* Read [the docs](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html).

In [None]:
weights = model.coef_.flatten()

Let's create a dataframe to view the weight assigned to each column.

In [None]:
weights_df = pd.DataFrame({
    'columns': train_inputs.columns,
    'weight': weights
}).sort_values('weight', ascending=False)

In [None]:
weights_df

##Making Predictions
The model can be used to make predictions on new inputs using the following helper function:

In [None]:
def predict_input(single_input):
    input_df = pd.DataFrame([single_input])
    input_df[numeric_cols] = imputer.transform(input_df[numeric_cols])
    input_df[numeric_cols] = scaler.transform(input_df[numeric_cols])
    input_df[encoded_cols] = encoder.transform(input_df[categorical_cols].values)
    X_input = input_df[numeric_cols + encoded_cols]
    return model.predict(X_input)[0]

In [None]:
sample_input = {
    

}

In [None]:
len(sample_input)

In [None]:
predicted_price = predict_input(sample_input)

##Saving the model
Let's save the model (along with other useful objects) to disk, so that we use it for making predictions without retraining.

In [None]:
import joblib

In [None]:
house_price_predictor = {
    'model': model,
    'imputer': imputer,
    'scaler': scaler,
    'encoder': encoder,
    'input_cols': input_cols,
    'target_col': target_col,
    'numeric_cols': numeric_cols,
    'categorical_cols': categorical_cols,
    'encoded_cols': encoded_cols
}

In [None]:
joblib.dump(house_price_predictor, 'house_price_predictor.joblib')