# Ecommerce Customer Spend Prediction with Linear Regression:

- ## Objective:
The objective of using linear regression for this dataset is to predict the Yearly Amount Spent by customers based on various customer features such as Time on App, Time on Website, and Length of Membership. By applying linear regression, the goal is to establish a linear relationship between these independent variables (customer behaviors) and the dependent variable (the amount spent annually). This model can then be used to make predictions on future customer spending, assisting in business decisions and customer segmentation strategies.


- ## About the dataste:
This dataset contains data of customer from a ecommerce platform. (2022-01-19)
Can use this data to predict the Yearly Amount Spent based on customer features

Columns:

-Email

-Address

-Avatar

-Time on App

-Time on Website

-Length of Membership

-Yearly Amount Spent

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

### import libraries:

In [4]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
%matplotlib inline
plt.rcParams["figure.figsize"] = (10,6)
#used to configure the Matplotlib plotting library for inline plotting and to set the default figure size, respectively.

### Load the dataset:

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [2]:
df = pd.read_csv("/home/srujan-panchajanya-s-s/BGS GEN AI/DAY 1/Ecommerce Customers.csv")
df.head()

Unnamed: 0,Email,Address,Avatar,Avg. Session Length,Time on App,Time on Website,Length of Membership,Yearly Amount Spent
0,mstephenson@fernandez.com,"835 Frank Tunnel\nWrightmouth, MI 82180-9605",Violet,34.497268,12.655651,39.577668,4.082621,587.951054
1,hduke@hotmail.com,"4547 Archer Common\nDiazchester, CA 06566-8576",DarkGreen,31.926272,11.109461,37.268959,2.664034,392.204933
2,pallen@yahoo.com,"24645 Valerie Unions Suite 582\nCobbborough, D...",Bisque,33.000915,11.330278,37.110597,4.104543,487.547505
3,riverarebecca@gmail.com,"1414 David Throughway\nPort Jason, OH 22070-1220",SaddleBrown,34.305557,13.717514,36.721283,3.120179,581.852344
4,mstephens@davidson-herman.com,"14023 Rodriguez Passage\nPort Jacobville, PR 3...",MediumAquaMarine,33.330673,12.795189,37.536653,4.446308,599.406092


## Explore the dataset

In [3]:
df.info()
df.shape

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 8 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Email                 500 non-null    object 
 1   Address               500 non-null    object 
 2   Avatar                500 non-null    object 
 3   Avg. Session Length   500 non-null    float64
 4   Time on App           500 non-null    float64
 5   Time on Website       500 non-null    float64
 6   Length of Membership  500 non-null    float64
 7   Yearly Amount Spent   500 non-null    float64
dtypes: float64(5), object(3)
memory usage: 31.4+ KB


(500, 8)

In [None]:
df.columns

In [None]:
df.dtypes

In [None]:
df.nunique()

In [None]:
df.describe().T

- The Yearly Amount Spent has a relatively large spread (standard deviation of 79.31), indicating a significant variation in how much customers spend annually
.
- Time on Website seems to have a tighter range compared to Time on App, which suggests that customers generally spend more time on the website than on the app.

- The Length of Membership ranges from customers who have been members for less than a year (0.27 years) to those who have been members for almost 7 years (6.92 years).

In [None]:
#is there any missing values?
if df.isnull().sum().any():
 print("yes")
else:
    print("No")

In [None]:
#is there any dupilacted values?
if df.duplicated().any():
    print("yes")
else:
    print("No")

### data visualization

In [None]:
# Create a jointplot of Time on Website and Yearly Amount Spent, with a kde scatter plot and a histogram marginal distribution plot
sns.jointplot(x="Time on Website", y="Yearly Amount Spent", data=df)

**No let's see Time on App & Yearly Amount Spent**

In [None]:
# Create a jointplot of Time on Website and Yearly Amount Spent
sns.jointplot(x="Time on App", y="Yearly Amount Spent", data=df)

**Use jointplot to create a 2D hex bin plot comparing Time on App and Length of Membership**

In [None]:
sns.jointplot(x="Time on Website", y="Length of Membership",data=df, kind='hex')

**Let's explore these types of relationships across the entire data set. Use [pairplot](https://stanford.edu/~mwaskom/software/seaborn/tutorial/axis_grids.html#plotting-pairwise-relationships-with-pairgrid-and-pairplot) to recreate the plot below.(Don't worry about the the colors)**

In [None]:
sns.pairplot(data=df)

In [None]:
numeric_df = df.select_dtypes(include=['number'])
sns.heatmap(numeric_df.corr(), annot=True, vmin=-1, vmax=1, cmap='coolwarm')

Based off this plot the *Length of Membership*  looks to be the most correlated feature with Yearly Amount Spent.

In [None]:
sns.lmplot(x="Length of Membership", y="Yearly Amount Spent", data=df, line_kws={"color": "pink"})
plt.title("Yearly Amount Spent vs. Length of Membership")
plt.xlabel("Length of Membership")
plt.ylabel("Yearly Amount Spent")

# **Training and Testing Data**

 **Setting a variable X equal to the numerical features of the customers, which are: 'Avg. Session Length', 'Time on App', 'Time on Website', amd Length of Membership, and a variable y equal to the "Yearly Amount Spent" column.**

In [None]:
X = df[['Avg. Session Length','Time on App', 'Time on Website','Length of Membership']]
y = df["Yearly Amount Spent"]


In [None]:
#split the data into training and testing sets. Set test_size=0.3 and random_state=101**
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)


# **Training the Model**



In [None]:
#Create an instance of a LinearRegression() model named lm.
lm = LinearRegression()
# Instantiate the class

**Train/fit lm on the training data.**

In [None]:
lm.fit(X_train, y_train)
#This function trains the linear regression model lm on the training data
#X_train and y_train. #The training data consists of input features (X_train)
#and target values (y_train).

In [None]:
#Print out the coefficients of the model
print('Coefficients: \n', lm.coef_)  #  print  theat 1 , theat 2 ,theta 3, amd theta 4
print('intercept_y:\n', lm.intercept_) # print theta 0

## **Predicting Test Data**
**Now that we have fit our model, let's evaluate its performance by predicting off the test values!**

In [None]:
# Use lm.predict() to predict off the X_test set of the data.
predictions = lm.predict(X_test)

In [None]:
# Create a scatter plot of the predicted vs. actual values
plt.scatter(predictions, y_test)

# Add axis labels
plt.xlabel('Predicted Yearly Amount Spent')
plt.ylabel('Actual Yearly Amount Spent')

# Show the plot
plt.show()

## **Evaluating the Model**

**Calculate the Mean Absolute Error, Mean Squared Error, and the Root Mean Squared Error.**

In [None]:
# R squared
R2_score = r2_score(y_test, predictions)
print ('R2_score is', R2_score)

#Mean abslute error
MEA= mean_absolute_error(y_test, predictions)
print('Mean abslute error (MAE):', MEA)

#Mean squared error
MSE= mean_squared_error(y_test,predictions)
print('Mean squared error (MSE):', MSE)

#Root mean squared error
RMSE=  np.sqrt(mean_squared_error(y_test, predictions))
print('Root mean squared error (RMSE):', RMSE)

**Close RMSE and MSE scores indicate that your model is a good one because it means that the model is able to make predictions that are close to the actual values. This is important because it means that the model is not overfitting the training data and that it is able to generalize to new data. RMSE and MSE are both measures of the difference between the predicted and actual values. RMSE is the square root of the MSE. A lower RMSE or MSE indicates that the model is making more accurate predictions.**

# **Residuals**

You should have gotten a very good model with a good fit. Let's quickly explore the residuals to make sure everything was okay with our data.

**Plot a histogram of the residuals and make sure it looks normally distributed. Use either seaborn distplot, or just plt.hist().**

In [None]:
# Calculate the residuals
residuals = y_test - predictions

# Plot a histogram of the residuals
sns.distplot(residuals)

# Show the plot
plt.show()

## **Conclusion**
**We still want to figure out the answer to the original question, do we focus our efforst on mobile app or website development? Or maybe that doesn't even really matter, and Membership Time is what is really important.  Let's see if we can interpret the coefficients at all to get an idea.**

 **Recreate the dataframe below.**

In [None]:
coefs = lm.coef_

# Create a dataframe with the coefficients
df = pd.DataFrame(data=coefs, columns=['Coefficient'], index=X_train.columns)

# Print the dataframe
df.head()

In [None]:
# Visualize the model results
plt.scatter(predictions, y_test)
plt.xlabel('Predicted Yearly Amount Spent')
plt.ylabel('Actual Yearly Amount Spent')
plt.show()


# **How can you interpret these coefficients?**

**Holding all other features fixed, a 1 unit increase in Avg. Session Length is associated with an increase of 25.98 total dollars spent. Holding all other features fixed, a 1 unit increase in Time on App is associated with an increase** **of 38.59 total dollars spent. Holding all other features fixed, a 1 unit increase in Time on Website is associated with an increase of 0.19 total dollars spent. Holding all other features fixed, a 1 unit increase in Length of Membership is associated with an increase of 61.27 total dollars spent.**

# **Do you think the company should focus more on their mobile app or on their website?**

**This is tricky, there are two ways to think about this: Develop the Website to catch up to the performance of the mobile app, or develop the app more since that is what is working better. This sort of answer really depends on the other factors going on at the company, you would probably want to explore the relationship between Length of Membership and the App or the Website before coming to a conclusion!**