##**Classification Project using Social Network Ads Data**

**1) Install/ Import the required Python Packages/ Libraries**

In [3]:
#Import required python packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb
import warnings
warnings.filterwarnings("ignore")
from sklearn import preprocessing
%matplotlib inline

**2) Mounting the Google Drive**

In [4]:
# Mount the Google Drive
#from google.colab import drive
#drive.mount('/content/gdrive')

**3) Read the Data file and check**

In [5]:
# Read the Diabetes Data from .csv file and check the data shape (number of Rows and Columns)
#df = pd.read_csv('gdrive/My Drive/NCJ-MLP-Training-2022/NCJ-MLP-Projects-Latest/06-Bank-Loan-Project/Data-Files/Train_Loan_Status.csv')
#df = pd.read_csv('gdrive/My Drive/Students-ML-Training-2023/Demo-Projects/02-Customers-Purchase-Prediction/Data-Files/Training_Data_for_Purchase_Prediction.csv')
df = pd.read_csv('D:/Muthu-Office/Indi-ML-Projects-2023/03-Ananconda-Projects/11_Customers_Purchase_Prediction/Data-Files/Train_Social_Network_Ads_Data.csv')
print(df.shape)
df.head()

(350, 5)


Unnamed: 0,User ID,Gender,Age,EstimatedSalary,Purchased
0,15624510,Male,19,19000,0
1,15810944,Male,35,20000,0
2,15668575,Female,26,43000,0
3,15603246,Female,27,57000,0
4,15804002,Male,19,76000,0


##**I) Check and decide the ML Learning Type and sub-type as applicable**

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 350 entries, 0 to 349
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   User ID          350 non-null    int64 
 1   Gender           350 non-null    object
 2   Age              350 non-null    int64 
 3   EstimatedSalary  350 non-null    int64 
 4   Purchased        350 non-null    int64 
dtypes: int64(4), object(1)
memory usage: 13.8+ KB


**Observations on the given Dataset:**
* a) Number of Independet Variables: 12 (Identified)
* b) Number of Dependent Variable : 1 (Loan_Status) (Identified)
* c) There is no Missing Value in the Dependent Variable column "Loan_Status"


**Conclusions:**
###**a) The given dataset probably belongs to the"Supervised Learning" main-type**
###**b) Since the Dependent variable values are categorical in nature, the given dataset is of "Classification" sub-type.**

##**II) Check and remove the duplicate records, if any**

In [7]:
df.shape

(350, 5)

In [8]:
# Returns True for every row that is a duplicate, othwerwise False:
print(df.duplicated())

0      False
1      False
2      False
3      False
4      False
       ...  
345    False
346    False
347    False
348    False
349    False
Length: 350, dtype: bool


In [9]:
# Remove all duplicates:
df.drop_duplicates(inplace = True)

In [10]:
df.shape

(350, 5)

###**Conclusion: No Duplicate Records**

##**III) Check the Class balance**

In [11]:
df["Purchased"].value_counts()

Purchased
0    243
1    107
Name: count, dtype: int64

###**Conclusion: It is a Binary Classification with imbalanced Classes**

##**IV) Check for Missing Values and handle them as required**

**a) Check the Missing Values, if any**

In [12]:
df.isnull().sum()

User ID            0
Gender             0
Age                0
EstimatedSalary    0
Purchased          0
dtype: int64

**b) Checking the total number of rows having the missing Values**

In [13]:
df[df.isnull().any(axis=1)]

Unnamed: 0,User ID,Gender,Age,EstimatedSalary,Purchased


##**VI) Check the unique Values of each column and observe the following and take actions as required:**
* **a) Wrong Data in the columns, if any**
* **b) Wrong format of the data in the columns, if any**
* **c) Identify the columns which need to be categorically converted to numeric values by using Nominal method/ Ordinal Method**


###**Column-1: Loan_ID**

In [14]:
df['User ID'].value_counts()

User ID
15624510    1
15794253    1
15617877    1
15753874    1
15660541    1
           ..
15689237    1
15739160    1
15773447    1
15619653    1
15721835    1
Name: count, Length: 350, dtype: int64

**Observations:**
* a) Data in this column will not be contributing to the prediction of the Depenedent variable

**Decsion:**

**We will be dropping this column**

**Action:**

In [15]:
df.drop(['User ID'], axis = 1, inplace = True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 350 entries, 0 to 349
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Gender           350 non-null    object
 1   Age              350 non-null    int64 
 2   EstimatedSalary  350 non-null    int64 
 3   Purchased        350 non-null    int64 
dtypes: int64(3), object(1)
memory usage: 11.1+ KB


###**Column-2: Gender**

In [16]:
df['Gender'].value_counts()

Gender
Female    178
Male      172
Name: count, dtype: int64

**Observations:**
* a) Data in this column is of "Object" or "String" datatype. Also, the data levels are "Nominal" Type.

**Decsion:**

**We will be converting the data in this column into Numerical values using Nominal Type method "pd.get_dummies".**

**Action:**

In [17]:
le = preprocessing.LabelEncoder()
df['Gender'] = le.fit_transform(df.Gender.values)
df['Gender'].value_counts()

Gender
0    178
1    172
Name: count, dtype: int64

In [18]:
#encode the data
#gender = pd.DataFrame(df['Gender'])
#gender_encoded=pd.get_dummies(data= gender, drop_first=True)
#gender_encoded

###**Column-3: Married**

In [19]:
df['Age'].value_counts()

Age
35    32
37    17
26    16
40    14
41    14
27    13
39    12
28    12
31    11
38    11
47    11
30    11
48    11
36    10
42    10
29    10
33     9
24     9
32     9
46     8
49     7
20     7
19     7
45     6
23     6
52     6
34     6
25     6
22     5
59     5
18     5
21     4
58     4
53     4
57     4
55     3
60     3
50     3
56     2
51     2
43     2
54     2
44     1
Name: count, dtype: int64

###**Column-4: Dependents**

In [20]:
df['EstimatedSalary'].value_counts()

EstimatedSalary
72000     11
80000     10
79000      9
75000      8
59000      7
          ..
69000      1
37000      1
100000     1
45000      1
105000     1
Name: count, Length: 112, dtype: int64

In [21]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 350 entries, 0 to 349
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype
---  ------           --------------  -----
 0   Gender           350 non-null    int32
 1   Age              350 non-null    int64
 2   EstimatedSalary  350 non-null    int64
 3   Purchased        350 non-null    int64
dtypes: int32(1), int64(3)
memory usage: 9.7 KB


In [22]:
df['Purchased'].value_counts()

Purchased
0    243
1    107
Name: count, dtype: int64

In [23]:
df.describe()

Unnamed: 0,Gender,Age,EstimatedSalary,Purchased
count,350.0,350.0,350.0,350.0
mean,0.491429,36.328571,71237.142857,0.305714
std,0.500642,10.178805,34395.196787,0.461369
min,0.0,18.0,15000.0,0.0
25%,0.0,28.0,47000.0,0.0
50%,0.0,35.0,71000.0,0.0
75%,1.0,42.0,88000.0,1.0
max,1.0,60.0,150000.0,1.0


**Observations:**
* a) Here, all the Integer and float Column values are described.
* b) Each column has got a Standard Deviation, Min and Max Values.
* c) We can assume that there is no wrong data and wrong data format.
* **d) But we need to do Scaling**

In [24]:
df.corr()

Unnamed: 0,Gender,Age,EstimatedSalary,Purchased
Gender,1.0,-0.080695,-0.088822,-0.056851
Age,-0.080695,1.0,0.259816,0.583808
EstimatedSalary,-0.088822,0.259816,1.0,0.481313
Purchased,-0.056851,0.583808,0.481313,1.0


##**VII) Check the Test accuracy using appropriate algorithm and Holdout Method.**

##**Step-5: Slice X and y Values**

In [25]:
X = df.drop(['Purchased'], axis = 1)
Y = df['Purchased']
X.head()

Unnamed: 0,Gender,Age,EstimatedSalary
0,1,19,19000
1,1,35,20000
2,0,26,43000
3,0,27,57000
4,1,19,76000


In [26]:
Y.head()

0    0
1    0
2    0
3    0
4    0
Name: Purchased, dtype: int64

##**Step-6: Execute Train-Test-Split Command and Verify**

In [27]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state = 66)

In [28]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(280, 3)
(280,)
(70, 3)
(70,)


##**Step-7: Learn the Data and Predict the dependent Variable values for the "X_test"data using "LogisticRegression()" algorithm**

In [29]:
from sklearn.linear_model import LogisticRegression
#create an instance and fit the model
logmodel = LogisticRegression()
logmodel.fit(X_train, y_train)

In [30]:
y_pred = logmodel.predict(X_test)
y_pred

array([1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 1, 0, 0], dtype=int64)

##**Step-8: Calculate the Accuracy of the Model**

In [31]:
accuracy_lr = logmodel.score(X_test, y_test)
print("Accuracy of Logistic Regression on test set:",accuracy_lr)

Accuracy of Logistic Regression on test set: 0.6571428571428571


##**Step-9: Display the Confusion Matrix and Classification Report of the Model**

In [32]:
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test,y_pred))
print(classification_report(y_test,y_pred))

[[35  4]
 [20 11]]
              precision    recall  f1-score   support

           0       0.64      0.90      0.74        39
           1       0.73      0.35      0.48        31

    accuracy                           0.66        70
   macro avg       0.68      0.63      0.61        70
weighted avg       0.68      0.66      0.63        70



##**VIII) Implement the Scaling as required**

###**Use Normalization**

In [33]:
columnNames = ['Gender', 'Age', 'EstimatedSalary']

In [34]:
min_max_scaler_object = preprocessing.MinMaxScaler()
X_train1 = min_max_scaler_object.fit_transform(X_train)
X_train1 = pd.DataFrame(X_train1 , columns = columnNames)
X_train1.head()

Unnamed: 0,Gender,Age,EstimatedSalary
0,1.0,0.333333,0.022222
1,1.0,0.071429,0.540741
2,0.0,0.285714,0.348148
3,1.0,0.238095,0.474074
4,0.0,0.333333,1.0


In [35]:
min_max_scaler_object = preprocessing.MinMaxScaler()
X_test1 = min_max_scaler_object.fit_transform(X_test)
X_test1 = pd.DataFrame(X_test1 , columns = columnNames)
X_test1.head()

Unnamed: 0,Gender,Age,EstimatedSalary
0,0.0,0.809524,0.603306
1,1.0,0.261905,0.214876
2,1.0,1.0,0.702479
3,1.0,0.404762,0.46281
4,1.0,0.238095,0.876033


In [36]:
from sklearn.linear_model import LogisticRegression
#create an instance and fit the model
logmodel1 = LogisticRegression()
logmodel1.fit(X_train1, y_train)

In [37]:
#predictions
predictions1 = logmodel1.predict(X_test1)

In [38]:
print(confusion_matrix(y_test, predictions1))
print(classification_report(y_test,predictions1))

[[38  1]
 [16 15]]
              precision    recall  f1-score   support

           0       0.70      0.97      0.82        39
           1       0.94      0.48      0.64        31

    accuracy                           0.76        70
   macro avg       0.82      0.73      0.73        70
weighted avg       0.81      0.76      0.74        70



###**Use Standardization**

In [39]:
from sklearn.preprocessing import StandardScaler
std_scaler_object = preprocessing.StandardScaler()
X_train2 = std_scaler_object.fit_transform(X_train)
X_train2 = pd.DataFrame(X_train2 , columns = columnNames)
X_train2.head()

Unnamed: 0,Gender,Age,EstimatedSalary
0,1.028992,-0.372817,-1.52632
1,1.028992,-1.482264,0.485507
2,-0.971825,-0.574535,-0.261743
3,1.028992,-0.776253,0.226844
4,-0.971825,-0.372817,2.267412


In [40]:
from sklearn.preprocessing import StandardScaler
std_scaler_object = preprocessing.StandardScaler()
X_test2 = std_scaler_object.fit_transform(X_test)
X_test2 = pd.DataFrame(X_test2 , columns = columnNames)
X_test2.head()

Unnamed: 0,Gender,Age,EstimatedSalary
0,-1.028992,1.223712,0.561557
1,0.971825,-0.917784,-0.88521
2,0.971825,1.968579,0.930944
3,0.971825,-0.359133,0.038258
4,0.971825,-1.010892,1.577371


In [41]:
from sklearn.linear_model import LogisticRegression
#create an instance and fit the model
logmodel2 = LogisticRegression()
logmodel2.fit(X_train2, y_train)

In [42]:
#predictions
predictions2 = logmodel2.predict(X_test2)

In [43]:
print(confusion_matrix(y_test, predictions2))
print(classification_report(y_test,predictions2))

[[38  1]
 [17 14]]
              precision    recall  f1-score   support

           0       0.69      0.97      0.81        39
           1       0.93      0.45      0.61        31

    accuracy                           0.74        70
   macro avg       0.81      0.71      0.71        70
weighted avg       0.80      0.74      0.72        70



**Observation: Both the scaling methods gives the accuracy of 83%**

**Decision: We will use the "Normalization" method for our model.**


##**IX) Write out the transformed Input file for further usage**

In [44]:
X1 = df.drop(['Purchased'], axis = 1)
Y1 = df['Purchased']
X.head()

Unnamed: 0,Gender,Age,EstimatedSalary
0,1,19,19000
1,1,35,20000
2,0,26,43000
3,0,27,57000
4,1,19,76000


In [45]:
min_max_scaler_object = preprocessing.MinMaxScaler()
X2 = min_max_scaler_object.fit_transform(X1)
X2 = pd.DataFrame(X2 , columns = columnNames)
print(X2.shape)
X2.head()

(350, 3)


Unnamed: 0,Gender,Age,EstimatedSalary
0,1.0,0.02381,0.02963
1,1.0,0.404762,0.037037
2,0.0,0.190476,0.207407
3,0.0,0.214286,0.311111
4,1.0,0.02381,0.451852


In [46]:
df1 = pd.DataFrame(data=X2)
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 350 entries, 0 to 349
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Gender           350 non-null    float64
 1   Age              350 non-null    float64
 2   EstimatedSalary  350 non-null    float64
dtypes: float64(3)
memory usage: 8.3 KB


In [47]:
df1.head()

Unnamed: 0,Gender,Age,EstimatedSalary
0,1.0,0.02381,0.02963
1,1.0,0.404762,0.037037
2,0.0,0.190476,0.207407
3,0.0,0.214286,0.311111
4,1.0,0.02381,0.451852


In [48]:
df1 = pd.concat([df1,Y1], axis=1)
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 350 entries, 0 to 349
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Gender           350 non-null    float64
 1   Age              350 non-null    float64
 2   EstimatedSalary  350 non-null    float64
 3   Purchased        350 non-null    int64  
dtypes: float64(3), int64(1)
memory usage: 11.1 KB


In [49]:
df1.head()

Unnamed: 0,Gender,Age,EstimatedSalary,Purchased
0,1.0,0.02381,0.02963,0
1,1.0,0.404762,0.037037,0
2,0.0,0.190476,0.207407,0
3,0.0,0.214286,0.311111,0
4,1.0,0.02381,0.451852,0


In [50]:
#from google.colab import files
#df1.to_csv("gdrive/My Drive/NCJ-MLP-Training-2022/NCJ-MLP-Projects-Latest/06-Bank-Loan-Project/Data-Files/Loan_Status_Train_Preprocessed1.csv", index = False)
df1.to_csv('D:/Muthu-Office/Indi-ML-Projects-2023/03-Ananconda-Projects/11_Customers_Purchase_Prediction/Data-Files/Train_Social_Network_Ads_Data_Preprocessed.csv', index = False)