<a href="https://colab.research.google.com/github/MouadEttali/Machine-Learning-Study-/blob/main/Test_technique_Modjo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Science Challenge
**The purpose of this notebook is to document step by step the process of a rigorous machine learning application and classify a number of calls into their respective Tags(classes)** 


we'll start by implementing the next steps in order to get our desired outcome:


1.   Exploratory Data Analysis and Feature Engineering.
3.   Choosing Candidate Models.
4.   Training the models and evaluating them.
5.   Comparing the models.



In [None]:
#Importing the Libraries
import pandas as pd
import numpy as np
from scipy.stats import randint
import seaborn as sns # used for plot interactive graph. 
import matplotlib.pyplot as plt
import seaborn as sns
from io import StringIO
from sklearn.preprocessing import OrdinalEncoder
from sklearn.datasets import make_classification
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import chi2
from sklearn.feature_selection import SelectKBest , mutual_info_classif
from IPython.display import display
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from functools import partial
from sklearn.feature_selection import f_classif





# 1.  Exploratory Data Analysis and Feature Engineering.




In [None]:
#Loading the data
df = pd.read_csv('/content/drive/MyDrive/DS-challenge-dataset.csv')
df.shape

(3786, 17)

In [None]:
df.head(5).T # My Personal way of exploring the data by showing Columns in rows for easier reading and intial comprehension


Unnamed: 0,0,1,2,3,4
id,5288,5285,5282,5277,5279
duration,657.344,205.479,1607.1,249.913,1948.75
date,2021-10-11 12:02:24,2021-10-08 18:04:23,2021-10-08 11:03:51,2021-10-07 18:10:25,2021-10-07 18:00:00
userId,14,63,18,23,8
modifiedById,,,,,
phoneProvider,zoom,aircall,zoom,aircall,google
direction,outbound,outbound,outbound,outbound,outbound
mediaType,video,audio,video,audio,video
dealId,543,,488,,
userTalkRatio,0.292525,0.452397,0.483552,0.452936,0.315928


In [None]:
df.dtypes

id                           int64
duration                   float64
date                        object
userId                       int64
modifiedById               float64
phoneProvider               object
direction                   object
mediaType                   object
dealId                     float64
userTalkRatio              float64
longestContactMonologue    float64
patience                   float64
interactionSpeed           float64
role                        object
teams                       object
contacts                   float64
tag                         object
dtype: object

**I noticed here that some columns are of type object ( i.e strings ) which means we'll have to either encode them or extract valuable information from them in order to pass through our predictive pipeline.**

In [None]:
#before anything let's get rid of the huge decimals by rounding up to 2 numbers after the decimal
df = df.round(2)# this is just musch cleaner and easier to visualize.
df 

Unnamed: 0,id,duration,date,userId,modifiedById,phoneProvider,direction,mediaType,dealId,userTalkRatio,longestContactMonologue,patience,interactionSpeed,role,teams,contacts,tag
0,5288,657.34,2021-10-11 12:02:24,14,,zoom,outbound,video,543.0,0.29,80.30,0.77,3.01,admin,{Sales},1.0,Client Follow Up
1,5285,205.48,2021-10-08 18:04:23,63,,aircall,outbound,audio,,0.45,23.71,0.41,8.18,admin,{Sales},1.0,Cold Call
2,5282,1607.10,2021-10-08 11:03:51,18,,zoom,outbound,video,488.0,0.48,57.84,0.28,4.59,admin,{Account Manager},1.0,Client Follow Up
3,5277,249.91,2021-10-07 18:10:25,23,,aircall,outbound,audio,,0.45,43.19,1.77,5.52,admin,{Sales},1.0,Unscheduled Follow up
4,5279,1948.75,2021-10-07 18:00:00,8,,google,outbound,video,,0.32,267.82,0.47,2.03,admin,{Sales},2.0,1st Call
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3781,6,4448.00,2020-05-18 12:01:18,1,,zoom,outbound,audio,,0.26,198.98,1.65,2.01,admin,{Sales},1.0,Demo
3782,5,3756.00,2020-05-14 14:00:35,3,,zoom,outbound,audio,,0.45,257.12,0.83,1.71,admin,"{Account Manager,Product}",3.0,Other
3783,4,2139.00,2020-05-14 10:04:10,1,,zoom,outbound,audio,,0.30,153.16,1.30,2.44,admin,{Sales},1.0,1st Call
3784,2,2151.00,2020-05-12 15:04:06,1,,zoom,outbound,audio,,0.42,120.53,0.73,2.57,admin,{Sales},2.0,1st Call


**let's start by extracting all we can from the date column.**


In [None]:
# First I'll convert the type to datetime
df['date'] = pd.to_datetime(df['date'], 
 format = '%Y-%m-%d %H:%M:%S')
df.date.head()

0   2021-10-11 12:02:24
1   2021-10-08 18:04:23
2   2021-10-08 11:03:51
3   2021-10-07 18:10:25
4   2021-10-07 18:00:00
Name: date, dtype: datetime64[ns]

**My rational here is that since these are phone calls we're working with,the time of the day and the month of the year might affect the underlying sentiments behind the call. For example some clients might be in a bad mood receiving a call during the first or last hour of their workday, or maybe our client is an auditing firm and their workload is much more subtantial between January and Mars. this is purely out of intuition however it also does help with our main task of dealing with the string data types**

In [None]:
# Then let's get the important data into respective columns
df['month']= df['date'].dt.month
df['hour']= df['date'].dt.hour
df['dayOfWeek']= df['date'].dt.dayofweek
df.head()

Unnamed: 0,id,duration,date,userId,modifiedById,phoneProvider,direction,mediaType,dealId,userTalkRatio,longestContactMonologue,patience,interactionSpeed,role,teams,contacts,tag,month,hour,dayOfWeek
0,5288,657.34,2021-10-11 12:02:24,14,,zoom,outbound,video,543.0,0.29,80.3,0.77,3.01,admin,{Sales},1.0,Client Follow Up,10,12,0
1,5285,205.48,2021-10-08 18:04:23,63,,aircall,outbound,audio,,0.45,23.71,0.41,8.18,admin,{Sales},1.0,Cold Call,10,18,4
2,5282,1607.1,2021-10-08 11:03:51,18,,zoom,outbound,video,488.0,0.48,57.84,0.28,4.59,admin,{Account Manager},1.0,Client Follow Up,10,11,4
3,5277,249.91,2021-10-07 18:10:25,23,,aircall,outbound,audio,,0.45,43.19,1.77,5.52,admin,{Sales},1.0,Unscheduled Follow up,10,18,3
4,5279,1948.75,2021-10-07 18:00:00,8,,google,outbound,video,,0.32,267.82,0.47,2.03,admin,{Sales},2.0,1st Call,10,18,3


**One thing that's been on my mind since the first exploration are these Null values mainly in the 'modifiedById' and 'dealId' , let's see how bad is it and if we can/should do anything about it**

In [None]:
df[['modifiedById','dealId']].isnull().sum()

modifiedById    3006
dealId          2182
dtype: int64

**The huge number of nulls in these two columns made me slightly worried about the state of the other columns in the data so I wanted to take a look at them as well.**

In [None]:
temp = df[df.columns].isnull().sum()
temp = temp.to_frame('missing_values')
temp['percentage_of_missing_values'] = (df[df.columns].isnull().sum()/df.shape[0])*100
temp = temp.round(2)[temp.percentage_of_missing_values !=0]
temp

Unnamed: 0,missing_values,percentage_of_missing_values
modifiedById,3006,79.4
dealId,2182,57.63
teams,58,1.53
contacts,1,0.03


****

**Okey, now that I have a good idea on what needs to be done in every column in regards to the Null values, let's start with the easiest issues for teams and contacts then we'll move to more drastic measures for the modifiedById and DealId columns**

In [None]:
def get_most_occurent_value(Pandascolumn):
    value = df[Pandascolumn][df[Pandascolumn].notnull()].value_counts().index[0]
    numberOfOccurence = df[Pandascolumn][df[Pandascolumn].notnull()].value_counts().max()
    print(f" The value '{value}' occured '{numberOfOccurence}' times.")
    return value

**The function above returns the most frequent value (their mode) in a certain column, using these two values I'll fix the null problems in the contacts and teams columns.**

In [None]:
most_frequent_contact = get_most_occurent_value('contacts')
most_frequent_team = get_most_occurent_value('teams')

 The value '1.0' occured '2761' times.
 The value '{Sales}' occured '3246' times.


In [None]:
print(most_frequent_team)

{Sales}


In [None]:
# replacing na values in contacts and teams with their respective mode
df["contacts"].fillna(most_frequent_contact, inplace = True)
df["teams"].fillna(most_frequent_team, inplace = True)

**In the missing rows table below we realize that DealId column is 58% missing values, and modifiedById is almost 80% missing values! this is more than enough for us to take the harsh decision of dropping the columns in this use case.**

In [None]:
temp

Unnamed: 0,missing_values,percentage_of_missing_values
modifiedById,3006,79.4
dealId,2182,57.63
teams,58,1.53
contacts,1,0.03


 **Dropping columns that won't affect the prediction**

**Let's move on now to choosing the other columns that are counterproductive for the prediction task, the 'id' column mainly is basically just noise**

**The Date columne since we've already extracted the most valuable data out of it ( ie ,hour , month , day of the week ) we can drop that as well.**

In [None]:
#dropping the columns with a huge number of missings values
df = df.drop(['id','date','modifiedById','dealId'],axis=1)
df

Unnamed: 0,duration,userId,phoneProvider,direction,mediaType,userTalkRatio,longestContactMonologue,patience,interactionSpeed,role,teams,contacts,tag,month,hour,dayOfWeek
0,657.34,14,zoom,outbound,video,0.29,80.30,0.77,3.01,admin,{Sales},1.0,Client Follow Up,10,12,0
1,205.48,63,aircall,outbound,audio,0.45,23.71,0.41,8.18,admin,{Sales},1.0,Cold Call,10,18,4
2,1607.10,18,zoom,outbound,video,0.48,57.84,0.28,4.59,admin,{Account Manager},1.0,Client Follow Up,10,11,4
3,249.91,23,aircall,outbound,audio,0.45,43.19,1.77,5.52,admin,{Sales},1.0,Unscheduled Follow up,10,18,3
4,1948.75,8,google,outbound,video,0.32,267.82,0.47,2.03,admin,{Sales},2.0,1st Call,10,18,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3781,4448.00,1,zoom,outbound,audio,0.26,198.98,1.65,2.01,admin,{Sales},1.0,Demo,5,12,0
3782,3756.00,3,zoom,outbound,audio,0.45,257.12,0.83,1.71,admin,"{Account Manager,Product}",3.0,Other,5,14,3
3783,2139.00,1,zoom,outbound,audio,0.30,153.16,1.30,2.44,admin,{Sales},1.0,1st Call,5,10,3
3784,2151.00,1,zoom,outbound,audio,0.42,120.53,0.73,2.57,admin,{Sales},2.0,1st Call,5,15,1


In [None]:
df.isnull().sum()

duration                   0
userId                     0
phoneProvider              0
direction                  0
mediaType                  0
userTalkRatio              0
longestContactMonologue    0
patience                   0
interactionSpeed           0
role                       0
teams                      0
contacts                   0
tag                        0
month                      0
hour                       0
dayOfWeek                  0
dtype: int64

**Now that our data is cleaned, we need to prepare it for our machine learning pipeline, mainly with encoding our categorical data in columns  	phoneProvider 	direction 	mediaType  role  teams  tag**