# <font color='blue'>Customer's Conversion Prediction: </font> 
I Got the best results by using XGBoost with 96% accuracy; top 1 in Kaggle
---
 
* **Part 1 - Data Preprocessing**
   1. Importing libraries
   2. Importing the dataset
   3. Dataset information (Pandas Profiling)
   4. Dropping unnecessary columns
      - "Train" set
      - "Test" set
   5. Taking care of misssing data for "Train" and "Test" Data
      - 'Administrative'          
      - 'Administrative_Duration'
      - 'Informational'
      - 'Informational_Duration' 
      - 'ProductRelated' 
      - 'ProductRelated_Duration'
      - 'BounceRates'
      - 'ExitRates'
   6. Taking care of some outliers
   7. Encoding categorical data
      - 'Month'        
      - 'VisitorType'
      - 'Weekend'
      - 'OperatingSystems'
      - 'Browser'
      - 'Region
      - 'TrafficType'
   8. Spliting the Train & Test datasets
   9. Feature Scaling   
* **Part 2 - Training the Classification model**
   1. XGBoost
   2. Other algorithms
   3. Accuracy score  
* **Part 3 - Creating a submission.csv**

#**Data Preprocessing**

## **Importing libraries**

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
sns.set()
%matplotlib inline

## **Importing the dataset**


In [3]:
!unzip dataset-customer.zip

Archive:  dataset-customer.zip
  inflating: Test (1).csv            
  inflating: Train (1).csv           
  inflating: sample.csv              


In [5]:
#renale file to Train.csv and Test.csv
train = pd.read_csv('Train.csv')
test = pd.read_csv('Test.csv')

In [None]:
train

In [None]:
test

## **Dataset information**

In [6]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8731 entries, 0 to 8730
Data columns (total 19 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   id                       8731 non-null   int64  
 1   Administrative           8717 non-null   float64
 2   Administrative_Duration  8717 non-null   float64
 3   Informational            8717 non-null   float64
 4   Informational_Duration   8717 non-null   float64
 5   ProductRelated           8717 non-null   float64
 6   ProductRelated_Duration  8717 non-null   float64
 7   BounceRates              8717 non-null   float64
 8   ExitRates                8717 non-null   float64
 9   PageValues               8731 non-null   float64
 10  SpecialDay               8731 non-null   float64
 11  Month                    8731 non-null   object 
 12  OperatingSystems         8731 non-null   int64  
 13  Browser                  8731 non-null   int64  
 14  Region                  

In [None]:
train.isna()

In [None]:
train.isna().sum()

In [None]:
train.isna().sum().sum()

In [None]:
train.describe()

In [None]:
train.corr()

## **Dropping unnecessary columns**

In [None]:
#Browser, Operating Systems, Weekend and Regin are the features that has the least corrolation with Revenue
#Operating Systems and Region shows a very tiny corrolationn with the revenue
#I dropped Browser but when I found better results with it
#When I dropped Weekend I got better results

In [7]:
train = train.drop(columns=['id', 'OperatingSystems', 'Weekend', 'Region'])

In [8]:
test = test.drop(columns=['OperatingSystems', 'Region','Weekend'])

In [10]:
#dropping duplicated lines
if train.duplicated().sum() >0:
  train.drop_duplicates(inplace=True).reset_index()

## **Taking care of misssing data**

In [11]:
#train dataset
std0 = train["Administrative_Duration"].std()
mean0 = train['Administrative'].mean()
std1 = train["Informational_Duration"].std()
mean1 = train['Informational'].mean()
std2 = train["ProductRelated_Duration"].std()
mean2 = train['ProductRelated'].mean()
exit = train['ExitRates'].mean()
bounce = train['BounceRates'].max()
train = train.fillna({
'Administrative' :  mean0,         
'Administrative_Duration': std0,   
'Informational'          :   mean1,
'Informational_Duration'  : std1, 
'ProductRelated'           : mean2,
'ProductRelated_Duration'   : std1,
'BounceRates'                : bounce,
'ExitRates'                 : exit
})

In [12]:
#test dataset
std0 = test["Administrative_Duration"].std()
mean0 = test['Administrative'].mean()
std1 = test["Informational_Duration"].std()
mean1 = test['Informational'].mean()
std2 = test["ProductRelated_Duration"].std()
mean2 = test['ProductRelated'].mean()
exit = test['ExitRates'].mean()
bounce = test['BounceRates'].max()
test = test.fillna({
   'Administrative' :  mean0,         
'Administrative_Duration': std0,   
'Informational'          :   mean1,
'Informational_Duration'  : std1, 
'ProductRelated'           : mean2,
'ProductRelated_Duration'   : std1,
'BounceRates'                : bounce,
'ExitRates'                 : exit
})


## **Taking care of some outliers**

In [None]:
#I plotted it to see the outliers

In [13]:
train = train[ train['Informational'] < 20]

In [14]:
train = train[ train['Administrative'] < 25]

## **Encoding categorical data**

In [19]:
# Check unique values for each categorical column
cat_cols = ['Month',	'VisitorType', 'Browser',  'TrafficType'	]
for col in cat_cols:
  print(col)
  print(train[col].unique(), '\n')

Month
['Feb' 'Mar' 'May' 'Oct' 'June' 'Jul' 'Aug' 'Nov' 'Sep' 'Dec'] 

VisitorType
['Returning_Visitor' 'New_Visitor' 'Other'] 

Browser
[ 1  2  3  4  5  6  7 10  8  9 12 13 11] 

TrafficType
[ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 18 19 16 17 20] 



In [20]:
train['Revenue'] = train['Revenue'].map({
     False : 0,
      True :1
      })

In [None]:
# I did that before dropping weekend
"""
train['Weekend'] = train['Weekend'].map({ 
    False : 0,  
    True :1
    })
test['Weekend'] = test['Weekend'].map({ 
    False : 0,  
    True :1
    })"""

In [21]:
train = pd.get_dummies(train)

In [22]:
X_test = pd.get_dummies(test)

## **Spliting the Train & Test datasets**

In [23]:
X_train = train.drop('Revenue', axis=1)
Y_train = train['Revenue']

In [24]:
#the test dataset contains only November and December months so we need to add it
for c1 in ['Month_Aug', 'Month_Feb',	'Month_Jul',	'Month_June',	'Month_Mar',	'Month_May',	'Month_Oct',	'Month_Sep']:
    X_test[c1] = 0

In [25]:
#Sorting the test and the train dataset in the same order
X_test = X_test.reindex(columns=sorted(X_train.columns))
X_train = X_train.reindex(columns=sorted(X_train.columns))

## **Feature Scaling**

In [26]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# **Training the Classification model**

##**XGBoost**

In [27]:
from xgboost import XGBClassifier
classifier = XGBClassifier(max_depth=6)
classifier.fit(X_train, Y_train)

XGBClassifier(max_depth=6)

In [28]:
y_preds = classifier.predict(X_test)
test['Revenue'] = y_preds

In [29]:
submission = pd.DataFrame({
        "id": test["id"],
        "Revenue": y_preds
    })

## **Other Algorithms**

In [None]:
"""from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
classifier.fit(X_train, Y_train)"""
"""from sklearn.ensemble import GradientBoostingClassifier
classifier = GradientBoostingClassifier()
classifier.fit(X_train, Y_train)"""
"""from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
classifier.fit(X_train, Y_train)"""
"""from sklearn.svm import SVC
classifier = SVC(kernel = 'rbf', random_state = 0)
classifier.fit(X_train, Y_train)
Y_pred = classifier.predict(X_test)"""
"""from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 42)
classifier.fit(X_train, Y_train)"""

## **Accuracy score**

In [30]:
from sklearn.metrics import accuracy_score
classifier.score(X_train, Y_train)
classifier = round(classifier.score(X_train, Y_train) * 100, 2)
classifier

96.36

# **Creating a submission.csv**

In [31]:
submission.to_csv('solution.csv', index=False)