##**Classification Project using Loan Status Data**

**1) Install/ Import the required Python Packages/ Libraries**

In [None]:
#Import required python packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb
from sklearn import preprocessing
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline

**2) Mounting the Google Drive**

In [None]:
# Mount the Google Drive
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


**3) Read the Data file and check**

In [None]:
# Read the Diabetes Data from .csv file and check the data shape (number of Rows and Columns)
df = pd.read_csv('gdrive/My Drive/SRM-Internship-2021-Latest/Marketplace-Features-Creation-Project/04-Output-cum-Input-Files2/Mkt_Features_Classification.csv')
print(df.shape)
df.head()

(34050, 10)


Unnamed: 0,UserID,No_of_days_Visited_7_Days,No_Of_Products_Viewed_15_Days,User_Vintage,Most_Viewed_product_15_Days,Most_Active_OS,Recently_Viewed_Product,Pageloads_last_7_days,Clicks_last_7_days,Target_Customers
0,U106593,2,9,259,PR100017,ANDROID,PR103384,4,5,0
1,U108297,2,6,1097,PR101030,WINDOWS,PR100108,9,4,0
2,U132443,0,0,240,Product101,WINDOWS,PR100070,0,0,0
3,U134616,0,0,447,Product101,WINDOWS,PR100495,0,0,0
4,U130784,0,0,262,Product101,CHROME OS,PR102323,0,0,0


##**I) Check and decide the ML Learning Type and sub-type as applicable**

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34050 entries, 0 to 34049
Data columns (total 10 columns):
 #   Column                         Non-Null Count  Dtype 
---  ------                         --------------  ----- 
 0   UserID                         34050 non-null  object
 1   No_of_days_Visited_7_Days      34050 non-null  int64 
 2   No_Of_Products_Viewed_15_Days  34050 non-null  int64 
 3   User_Vintage                   34050 non-null  int64 
 4   Most_Viewed_product_15_Days    34050 non-null  object
 5   Most_Active_OS                 34050 non-null  object
 6   Recently_Viewed_Product        34050 non-null  object
 7   Pageloads_last_7_days          34050 non-null  int64 
 8   Clicks_last_7_days             34050 non-null  int64 
 9   Target_Customers               34050 non-null  int64 
dtypes: int64(6), object(4)
memory usage: 2.6+ MB


In [None]:
df.isnull().sum()

UserID                           0
No_of_days_Visited_7_Days        0
No_Of_Products_Viewed_15_Days    0
User_Vintage                     0
Most_Viewed_product_15_Days      0
Most_Active_OS                   0
Recently_Viewed_Product          0
Pageloads_last_7_days            0
Clicks_last_7_days               0
Target_Customers                 0
dtype: int64

**Observations on the given Dataset:**
* a) Number of Independet Variables: 12 (Identified)
* b) Number of Dependent Variable : 1 (Loan_Status) (Identified)
* c) There is no Missing Value in the Dependent Variable column "Loan_Status"


**Conclusions:**
###**a) The given dataset probably belongs to the"Supervised Learning" main-type**
###**b) Since the Dependent variable values are categorical in nature, the given dataset is of "Classification" sub-type.**

##**II) Check and remove the duplicate records, if any**

In [None]:
df.shape

(34050, 10)

In [None]:
# Returns True for every row that is a duplicate, othwerwise False:
print(df.duplicated())

0        False
1        False
2        False
3        False
4        False
         ...  
34045    False
34046    False
34047    False
34048    False
34049    False
Length: 34050, dtype: bool


In [None]:
# Remove all duplicates:
df.drop_duplicates(inplace = True)

In [None]:
df.shape

(34050, 10)

###**Conclusion: No Duplicate Records**

##**III) Check the Class balance**

In [None]:
df["Target_Customers"].value_counts()

0    33725
1      325
Name: Target_Customers, dtype: int64

###**Conclusion: It is a Binary Classification with imbalanced Classes**

##**IV) Check for Missing Values and handle them as required**

**a) Check the Missing Values, if any**

In [None]:
df.isnull().sum()

UserID                           0
No_of_days_Visited_7_Days        0
No_Of_Products_Viewed_15_Days    0
User_Vintage                     0
Most_Viewed_product_15_Days      0
Most_Active_OS                   0
Recently_Viewed_Product          0
Pageloads_last_7_days            0
Clicks_last_7_days               0
Target_Customers                 0
dtype: int64

**Observations:**
* a) Here, there is no missing values in this file.


**Decision and Actions:**

###**No action is required.**

##**VI) Check the unique Values of each column and observe the following and take actions as required:**
* **a) Wrong Data in the columns, if any** 
* **b) Wrong format of the data in the columns, if any**
* **c) Identify the columns which need to be categorically converted to numeric values by using Nominal method/ Ordinal Method**


##**Step-5: Slice X and y Values**

In [None]:
X = df.drop(['Target_Customers'], axis = 1)
Y = df['Target_Customers']
X.head()

Unnamed: 0,UserID,No_of_days_Visited_7_Days,No_Of_Products_Viewed_15_Days,User_Vintage,Most_Viewed_product_15_Days,Most_Active_OS,Recently_Viewed_Product,Pageloads_last_7_days,Clicks_last_7_days
0,U106593,2,9,259,PR100017,ANDROID,PR103384,4,5
1,U108297,2,6,1097,PR101030,WINDOWS,PR100108,9,4
2,U132443,0,0,240,Product101,WINDOWS,PR100070,0,0
3,U134616,0,0,447,Product101,WINDOWS,PR100495,0,0
4,U130784,0,0,262,Product101,CHROME OS,PR102323,0,0


In [None]:
Y.head()

0    0
1    0
2    0
3    0
4    0
Name: Target_Customers, dtype: int64

##**Step-6: Execute Train-Test-Split Command and Verify**

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.3, random_state = 66)

In [None]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(23835, 9)
(23835,)
(10215, 9)
(10215,)


In [None]:
y_test.value_counts()

0    10123
1       92
Name: Target_Customers, dtype: int64

In [None]:
Train1=X_train.copy()
print(Train1.shape)
Train1.head()

(23835, 9)


Unnamed: 0,UserID,No_of_days_Visited_7_Days,No_Of_Products_Viewed_15_Days,User_Vintage,Most_Viewed_product_15_Days,Most_Active_OS,Recently_Viewed_Product,Pageloads_last_7_days,Clicks_last_7_days
2586,U106331,1,2,776,PR100166,WINDOWS,PR100166,1,0
26266,U129596,0,0,82,Product101,ANDROID,PR100319,0,0
27854,U110157,1,1,300,PR100371,WINDOWS,PR100371,1,2
18514,U125417,0,1,404,PR100401,WINDOWS,PR100401,0,0
11219,U108112,4,7,80,PR100166,WINDOWS,PR100095,14,20


In [None]:
Train1['Target_Customers']=y_train
print(Train1.shape)
Train1.head()

(23835, 10)


Unnamed: 0,UserID,No_of_days_Visited_7_Days,No_Of_Products_Viewed_15_Days,User_Vintage,Most_Viewed_product_15_Days,Most_Active_OS,Recently_Viewed_Product,Pageloads_last_7_days,Clicks_last_7_days,Target_Customers
2586,U106331,1,2,776,PR100166,WINDOWS,PR100166,1,0,0
26266,U129596,0,0,82,Product101,ANDROID,PR100319,0,0,0
27854,U110157,1,1,300,PR100371,WINDOWS,PR100371,1,2,0
18514,U125417,0,1,404,PR100401,WINDOWS,PR100401,0,0,0
11219,U108112,4,7,80,PR100166,WINDOWS,PR100095,14,20,0


In [None]:
Train1.reset_index(drop=True, inplace=True)
Train1.head()

Unnamed: 0,UserID,No_of_days_Visited_7_Days,No_Of_Products_Viewed_15_Days,User_Vintage,Most_Viewed_product_15_Days,Most_Active_OS,Recently_Viewed_Product,Pageloads_last_7_days,Clicks_last_7_days,Target_Customers
0,U106331,1,2,776,PR100166,WINDOWS,PR100166,1,0,0
1,U129596,0,0,82,Product101,ANDROID,PR100319,0,0,0
2,U110157,1,1,300,PR100371,WINDOWS,PR100371,1,2,0
3,U125417,0,1,404,PR100401,WINDOWS,PR100401,0,0,0
4,U108112,4,7,80,PR100166,WINDOWS,PR100095,14,20,0


In [None]:
from google.colab import files
Train1.to_csv("gdrive/My Drive/SRM-Internship-2021-Latest/Marketplace-Features-Creation-Project/05-Output-cum-Input-Files3/Mkt_Features_Training_Data.csv", index = False)

In [None]:
Customer1=X_test.copy()
print(Customer1.shape)
Customer1.head()

(10215, 9)


Unnamed: 0,UserID,No_of_days_Visited_7_Days,No_Of_Products_Viewed_15_Days,User_Vintage,Most_Viewed_product_15_Days,Most_Active_OS,Recently_Viewed_Product,Pageloads_last_7_days,Clicks_last_7_days
19335,U124954,0,1,169,PR100047,WINDOWS,PR100047,0,0
28783,U124892,0,1,85,PR100279,WINDOWS,PR100279,0,0
5908,U123462,0,1,1057,PR100980,WINDOWS,Product101,0,0
26914,U107624,1,1,2163,PR100498,ANDROID,PR100498,1,1
16600,U122026,0,0,98,Product101,ANDROID,PR100023,0,0


In [None]:
#Customer1['Target_Customers']=y_test
#print(Customer1.shape)
#Customer1.head()

In [None]:
Customer1.reset_index(drop=True, inplace=True)
Customer1.head()

Unnamed: 0,UserID,No_of_days_Visited_7_Days,No_Of_Products_Viewed_15_Days,User_Vintage,Most_Viewed_product_15_Days,Most_Active_OS,Recently_Viewed_Product,Pageloads_last_7_days,Clicks_last_7_days
0,U124954,0,1,169,PR100047,WINDOWS,PR100047,0,0
1,U124892,0,1,85,PR100279,WINDOWS,PR100279,0,0
2,U123462,0,1,1057,PR100980,WINDOWS,Product101,0,0
3,U107624,1,1,2163,PR100498,ANDROID,PR100498,1,1
4,U122026,0,0,98,Product101,ANDROID,PR100023,0,0


In [None]:
from google.colab import files
Customer1.to_csv("gdrive/My Drive/SRM-Internship-2021-Latest/Marketplace-Features-Creation-Project/05-Output-cum-Input-Files3/Mkt_Features_Customer_Data.csv", index = False)