<a id='top'></a>
> ## <div style="text-align: center"> Machine Learning using Python</div>
### Example using Linear Regression Model

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/5/53/Linear_least_squares_example2.png/220px-Linear_least_squares_example2.png">

It is used to estimate real values (cost of houses, number of calls, total sales etc.) based on continuous variable(s). Here, we establish relationship between independent and dependent variables by fitting a best line. This best fit line is known as regression line and represented by a linear equation *Y= a X + b.

####  **I hope you find this kernel helpful and some <font color="red"><b>UPVOTES</b></font> would be very much appreciated**

<a id="top"></a> <br>
## Notebook  Content

1. [Imports](#1)
1. [Understand the Data](#2)
1. [Exploratory Data Analysis](#3)
1. [Estimator](#4)
1. [Training and Testing Data](#5)
1. [Training the Model](#6)
1. [Predicting Test Data](#7)
1. [Evaluating the Model](#8)
1. [Residuals](#9)
1. [Final Results](#10)

<a id="1"></a> <br>
## 1- Imports
** Import pandas, numpy, matplotlib,and seaborn. Then set %matplotlib inline 


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

<a id="2"></a> <br>
## 2- Understand the Data

We'll work with the FyntraCustomerData.csv. It has Customer info, such as Email, Address. Then it also has numerical value columns:

* Avg_Session_Length: Average session of in-store style advice sessions.
* Time_on_App: Average time spent on App in minutes
* Time_on_Website: Average time spent on Website in minutes
* Length_of_Membership: How many years the customer has been a member. 
* Yearly_Amount_Spent: The output parameter -- Amount Spend in thousand $ in a year 


** Read in the FyntraCustomerData.csv file as a DataFrame called customers.**

In [None]:
import numpy as np
import pandas as pd
customers = pd.read_csv("../input/dataset/FyntraCustomerData.csv")

**Check the head of customers, and check out its info() and describe() methods to get some insights into data

In [None]:
customers.head()

In [None]:
customers.describe()

In [None]:
customers.info()

1. ** No missing data and more easy to work with

<a id="3"></a> <br>
## 3- Exploratory Data Analysis

Let's explore the data! for better understanding**


Use seaborn to create a jointplot to compare the Time on Website and Yearly Amount Spent columns. **

In [None]:
#Check Correlations
# More time on site, more money spent.
sns.jointplot(x='Time_on_Website',y='Yearly_Amount_Spent',data=customers)

In [None]:
correlation = customers.corr()

In [None]:
sns.heatmap(correlation, cmap="YlGnBu")

In [None]:
sns.jointplot(x='Time_on_App',y='Yearly_Amount_Spent',data=customers)
# This one looks stronger correlation than Time_on_Website

**Let's explore these types of relationships across the entire data set. Use [pairplot](https://stanford.edu/~mwaskom/software/seaborn/tutorial/axis_grids.html#plotting-pairwise-relationships-with-pairgrid-and-pairplot) to recreate the plot below.

In [None]:
sns.pairplot(customers)

Based on this plot what looks to be the most correlated feature with Yearly Amount Spent

In [None]:
sns.lmplot(x='Length_of_Membership',y='Yearly_Amount_Spent',data=customers)

In [None]:
sns.jointplot(x='Length_of_Membership', y='Yearly_Amount_Spent', data=customers,kind="kde")

<a id="4"></a> <br>
## 4- Estimator
Given a scikit-learn estimator object named **model**, the following methods are available:

#### Available in all Estimators

**model.fit()** : fit training data. For supervised learning applications, this accepts two arguments: the data X and the labels y (e.g. model.fit(X, y)). For unsupervised learning applications, this accepts only a single argument, the data X (e.g. model.fit(X)).

---------------------------------------------------------

#### Available in supervised estimators

**model.predict()** : given a trained model, predict the label of a new set of data. This method accepts one argument, the new data X_new (e.g. model.predict(X_new)), and returns the learned label for each object in the array.

**model.predict_proba()** : For classification problems, some estimators also provide this method, which returns the probability that a new observation has each categorical label. In this case, the label with the highest probability is returned by model.predict().
**model.score()** : for classification or regression problems, most (all?) estimators implement a score method. Scores are between 0 and 1, with a larger score indicating a better fit.

---------------------------------------------------------
#### Available in unsupervised estimators

**model.predict()** : predict labels in clustering algorithms.
**model.transform()** : given an unsupervised model, transform new data into the new basis. This also accepts one argument X_new, and returns the new representation of the data based on the unsupervised model.
**model.fit_transform()** : some estimators implement this method, which more efficiently performs a fit and a transform on the same input data.

<a id="5"></a> <br>
## 5- Training and Testing Data

scikit-learn provides a helpful function for partitioning data, train_test_split, which splits out your data into a training set and a test set.

Training and test usually is 70% for training and 30% for test

- Training set for fitting the model -> to the numerical features of the customers
- Test set for evaluation only -> to the "Yearly_Amount_Spent" column

In [None]:
X = customers[['Avg_Session_Length', 'Time_on_App','Time_on_Website', 'Length_of_Membership']]

In [None]:
y = customers['Yearly_Amount_Spent']

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=85)

<a id="6"></a> <br>
## 6- Training the Model

Now its time to train our model on our training data!

** Import LinearRegression from sklearn.linear_model **

In [None]:
from sklearn.linear_model import LinearRegression

**Create an instance of a LinearRegression() model named lm.**

In [None]:
lm = LinearRegression()

** Train/fit lm on the training data.**

In [None]:
lm.fit(X_train,y_train)

In [None]:
#calculating the residuals
print('y-intercept             :' , lm.intercept_)
print('beta coefficients       :' , lm.coef_)

<a id="7"></a> <br>
## 7- Predicting Test Data
Now that we have fit our model, let's evaluate its performance by predicting off the test values!

** Use lm.predict() to predict off the X_test set of the data.**

In [None]:
predictions = lm.predict( X_test)

** Create a scatterplot of the real test values versus the predicted values.
Ideally for each point value and x and y axis should be same**

In [None]:
plt.scatter(y_test,predictions)
plt.xlabel('Y Test ')
plt.ylabel('Y Predicted ')

In [None]:
# here You can check the values Test vrs Prediction
dft = pd.DataFrame({'Y test': y_test, 'Y Pred':predictions})
dft.head(10)

<a id="8"></a> <br>
## 8- Evaluating the Model

Let's evaluate our model performance by calculating the residual sum of squares 

- Calculate the  Root Mean Squared Error
- Mean Abs Error MAE 
- Mean Sqrt Error MSE 
- r2 value 

In [None]:
# calculate these metrics by hand!
from sklearn import metrics

print('Mean Abs Error MAE      :' ,metrics.mean_absolute_error(y_test,predictions))
print('Mean Sqrt Error MSE     :' ,metrics.mean_squared_error(y_test,predictions))
print('Root Mean Sqrt Error RMSE:' ,np.sqrt(metrics.mean_squared_error(y_test,predictions)))
print('r2 value                :' ,metrics.r2_score(y_test,predictions))

<a id="9"></a> <br>
## 9- Residuals

You should have gotten a very good model with a good fit. Let's quickly explore the residuals to make sure everything was okay with our data. 

**Plot a histogram of the residuals and make sure it looks normally distributed. Use either seaborn distplot, or just plt.hist().**

In [None]:
sns.distplot((y_test-predictions),bins=50);

<a id="10"></a> <br>
## 10- Final Results

** Based on coefficients interpret company should focus more on their mobile app or on their website**

In [None]:
coeffecients = pd.DataFrame(lm.coef_,X.columns)
coeffecients.columns = ['Coeffecient']
coeffecients

Interpreting the coefficients:

- Holding all other features fixed, a 1 unit increase in **Avg. Session Length** is associated with an **increase of 26.08 total dollars spent**.
- Holding all other features fixed, a 1 unit increase in **Time on App** is associated with an **increase of 39.18 total dollars spent**.
- Holding all other features fixed, a 1 unit increase in **Time on Website** is associated with an **increase of 0.40 total dollars spent**.
- Holding all other features fixed, a 1 unit increase in **Length of Membership** is associated with an **increase of 61.41 total dollars spent**.


**Clearly Time On App leads to higher revenue conversion – hence company should focus on App rather than website.  Whether company should shutdown the website depends solely on management take on revenue through website **

## For more Machine Learning Algorithms check this kernel 

## [LINK](https://www.kaggle.com/marcovasquez/top-machine-learning-algorithms)


####  **I hope you find this kernel helpful and some <font color="red"><b>UPVOTES</b></font> would be very much appreciated**

<a href="#top" class="btn btn-primary btn-lg active" role="button" aria-pressed="true">Go to TOP</a>