#**Case study on Python Flask**

**Import neccessary libraries**

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

**Reading data**

In [2]:
data=pd.read_csv("/content/Social_Network_Ads.csv")
data.head()

Unnamed: 0,User ID,Gender,Age,EstimatedSalary,Purchased
0,15624510,Male,19,19000,0
1,15810944,Male,35,20000,0
2,15668575,Female,26,43000,0
3,15603246,Female,27,57000,0
4,15804002,Male,19,76000,0


**Explore data**

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   User ID          400 non-null    int64 
 1   Gender           400 non-null    object
 2   Age              400 non-null    int64 
 3   EstimatedSalary  400 non-null    int64 
 4   Purchased        400 non-null    int64 
dtypes: int64(4), object(1)
memory usage: 15.8+ KB


In [4]:
# shape - no.of rows and columns

data.shape

(400, 5)

In [5]:
#Checking for null values

data.isna().sum()

User ID            0
Gender             0
Age                0
EstimatedSalary    0
Purchased          0
dtype: int64

In [6]:
data.describe()

Unnamed: 0,User ID,Age,EstimatedSalary,Purchased
count,400.0,400.0,400.0,400.0
mean,15691540.0,37.655,69742.5,0.3575
std,71658.32,10.482877,34096.960282,0.479864
min,15566690.0,18.0,15000.0,0.0
25%,15626760.0,29.75,43000.0,0.0
50%,15694340.0,37.0,70000.0,0.0
75%,15750360.0,46.0,88000.0,1.0
max,15815240.0,60.0,150000.0,1.0


In [7]:
# to get the number of unique elements in each feature of the dataset
data.nunique()

User ID            400
Gender               2
Age                 43
EstimatedSalary    117
Purchased            2
dtype: int64

**Preprocessing the data**

In [8]:
# Drop unnecessary columns from the dataset

data.drop('User ID',axis=1,inplace=True)

There is no missing values so no need of missing value handling and we can do encoding for Gender.

**Encoding**

In [9]:
#Label encoding for the categorical variable Gender

from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()

data['Gender']=le.fit_transform(data['Gender'])
data.head()

Unnamed: 0,Gender,Age,EstimatedSalary,Purchased
0,1,19,19000,0
1,1,35,20000,0
2,0,26,43000,0
3,0,27,57000,0
4,1,19,76000,0


**Declare feature vector and target variable**

In [10]:
y=data['Purchased']
X=data.drop('Purchased',axis=1)


**Split data into separate training and test set**

In [11]:
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=42,test_size=0.20)


In [12]:
# check the shape of X_train and X_test

X_train.shape, X_test.shape,y_train.shape,y_test.shape

((320, 3), (80, 3), (320,), (80,))

This is a classification problem since the target variable Purchased indicating whether a user made a purchase or not.

Common classification algorithms for this type of problem include Logistic Regression, Decision Tree, Random Forest, Support Vector Machines (SVM), KNN.

I am going to do build logistic regression,decision tree, random forest and svm models and will consider the model with high accuracy.

**Logistic Regression**

In [13]:
from sklearn.linear_model import LogisticRegression
log_reg=LogisticRegression()

log_reg.fit(X_train,y_train)
log_pred=log_reg.predict(X_test)
log_pred

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [14]:
from sklearn.metrics import confusion_matrix,accuracy_score,precision_score,recall_score,f1_score


In [15]:
print('Accuracy:',accuracy_score(y_test,log_pred))
print('Precision:',precision_score(y_test,log_pred))
print('recall:',recall_score(y_test,log_pred))
print('F1:',f1_score(y_test,log_pred))

Accuracy: 0.65
Precision: 0.0
recall: 0.0
F1: 0.0


  _warn_prf(average, modifier, msg_start, len(result))


The accuracy score is low so we try scaling.

In [16]:
from sklearn.preprocessing import StandardScaler
Scaler=StandardScaler()

In [17]:
Scaled_Xtrain=Scaler.fit_transform(X_train)
Scaled_Xtest=Scaler.fit_transform(X_test)

Now we try using the scaled train test data to buid the model.

In [18]:
log_reg.fit(Scaled_Xtrain,y_train)
log_pred=log_reg.predict(Scaled_Xtest)
log_pred

array([0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0,
       0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0])

In [19]:
print('Accuracy:',accuracy_score(y_test,log_pred))
print('Precision:',precision_score(y_test,log_pred))
print('recall:',recall_score(y_test,log_pred))
print('F1:',f1_score(y_test,log_pred))

Accuracy: 0.875
Precision: 0.875
recall: 0.75
F1: 0.8076923076923077


**SVM**

In [20]:
from sklearn.svm import SVC
sv_clf=SVC(kernel='linear')       #'poly', 'linear', 'sigmoid', 'precomputed', 'rbf'
sv_clf.fit(X_train,y_train)
svm_pred=sv_clf.predict(X_test)

In [21]:
print('Accuracy:',accuracy_score(y_test,svm_pred))
print('Precision:',precision_score(y_test,svm_pred))
print('recall:',recall_score(y_test,svm_pred))
print('F1:',f1_score(y_test,svm_pred))

Accuracy: 0.85
Precision: 0.8333333333333334
recall: 0.7142857142857143
F1: 0.7692307692307692


**Decision Tree**

In [22]:
from sklearn.tree import DecisionTreeClassifier
dt_clf=DecisionTreeClassifier()
dt_clf.fit(X_train,y_train)
dt_pred= dt_clf.predict(X_test)

In [23]:
print('Accuracy:',accuracy_score(y_test,dt_pred))
print('Precision:',precision_score(y_test,dt_pred))
print('recall:',recall_score(y_test,dt_pred))
print('F1:',f1_score(y_test,dt_pred))

Accuracy: 0.8375
Precision: 0.7777777777777778
recall: 0.75
F1: 0.7636363636363638


**Random Forest**

In [24]:
from sklearn.ensemble import RandomForestClassifier
rf_clf=RandomForestClassifier()
rf_clf.fit(X_train,y_train)
rf_pred=rf_clf.predict(X_test)

In [25]:
print('Accuracy:',accuracy_score(y_test,rf_pred))
print('Precision:',precision_score(y_test,rf_pred))
print('recall:',recall_score(y_test,rf_pred))
print('F1:',f1_score(y_test,rf_pred))

Accuracy: 0.9
Precision: 0.8333333333333334
recall: 0.8928571428571429
F1: 0.8620689655172413


In [26]:
# rf_clf.predict([[1,20,95000]])

**Results and Conclusion:**

I build a Classifier model to predict whether the customer purchased or not. I build four models, logistic regression,SVM,random forest,decision tree.

In the model, I found that random forest model has high accuracy of 0.9125.

So, I am going to pickle my model.

In [27]:
import pickle
pickle.dump(rf_clf,open('model.pkl','wb'))


In [28]:
model=pickle.load(open('model.pkl','rb'))



In [29]:
model.predict([[1,20,95000]])



array([1])