<H1 > Welcome to Data Science Fundamentals: Pandas to Predictions</H1>

<p align="center">
  <img src="Datafain logo.jpg"/>
</p>

Hello. My name is [Ridwan Salahuddeen](https://www.linkedin.com/in/ridwan-salahuddeen/) and I shall be taking you through on this class.

Follow [DataFain](https://www.linkedin.com/company/data-for-all-initiative) on LinkedIn

## Table of Contents
<font >
<ol>
<li> Ice Breaker: What is Data Science?, Data Science Tools, Numpy Introduction, Quiz 1
<li> Importance of Numpy,More Hands-on Use of Numpy,Quiz 2,Assignment 1
<li> Introduction to Pandas: Creating Dataframes and Reading From Files,Quiz 3
<li> More on Pandas: Manipulating Dataframes,Quiz 4,Assignment 2
<li> Introduction to Matplotlib: Plotting Simple Charts,Quiz 5
    
<li> More on Matplotlib: Drawing Inference from plots,Quiz 6,Assignment 3
<li> Types of Machine Learning Algorithms,Quiz 7
<li> Understanding Learning Algorithms - Logistic Regression,Quiz 8,Assignment 4
<li> Using Sci-Kit Learn: Logistic Regression,Sample Classification Problem1,Quiz 9
<li> Sample Classification Problem 2,Quiz 10,Assignment 5
</ol>
</font>

# Final Project

**Problem Statement**
---

You have a telecom firm which has collected data of all its customers, and as the resident data scientist, you have been requested to predict which of the customers are likely to churn.

A customer is said to have churned when he/she stops using the services of the company/business.

**Description of Dataset**
---

The main types of attributes are:

1. Demographics (age, gender etc.)
2. Services availed (internet packs purchased, special offers etc)
3. Expenses (amount of recharge done per month etc.)

Based on all this past information, you want to build a model which will predict whether a particular customer will churn or not.

So the variable of interest, i.e. the target variable here is ‘Churn’ which will tell us whether or not a particular customer has churned. It is a binary variable 1 means that the customer has churned and 0 means the customer has not churned.
With 21 predictor variables we need to predict whether a particular customer will switch to another telecom provider or not.

**DATA**
---
Most of the columns have intuitive names that can easily be figured.

**PROPOSED PROCEDURE**
---
* Import the required libraries
* Importing all datasets
* Merging all datasets based on condition ("customer_id ")
* Data Cleaning - checking the null values
* Check for the missing values and replace them
* Model building
    * Binary encoding
    * One hot encoding
    * Creating dummy variables and removing the extra columns
* Feature selection using RFE - Recursive Feature Elimination
* Getting the predicted values on train set
* Creating a new column predicted with 1 if churn > 0.5 else 0
* Create a confusion matrix on train set and test
* Check the overall accuracy

Success Metric:
---
In this exercise, the success metric that will be employed is the **F1 Score**. Your goal is to achieve at least a **65% score**.

Dataset was gotten from [Kaggle](https://www.kaggle.com/dileep070/logisticregression-telecomcustomer-churmprediction)

### Importing Libraries

**TO DO 1**: Complete the code below to import the necessary libraries and modules. Do not worry if you cannot recall all the needed libraries and modules at the start. You can always come back to update it when you do remember.

In [2]:
import pandas as pd
import matplotlib.pyplot as plt

### Read in the data as dataframes

There are three datasets which you can download from 

In [18]:
cust = pd.read_csv('customer_data.csv')
cust.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents
0,7590-VHVEG,Female,0,Yes,No
1,5575-GNVDE,Male,0,No,No
2,3668-QPYBK,Male,0,No,No
3,7795-CFOCW,Male,0,No,No
4,9237-HQITU,Female,0,No,No


In [19]:
churn = pd.read_csv('churn_data.csv')
churn.head()

Unnamed: 0,customerID,tenure,PhoneService,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,1,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,34,Yes,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,2,Yes,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,45,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,2,Yes,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [20]:
internet = pd.read_csv('internet_data.csv')
internet.head()

Unnamed: 0,customerID,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies
0,7590-VHVEG,No phone service,DSL,No,Yes,No,No,No,No
1,5575-GNVDE,No,DSL,Yes,No,Yes,No,No,No
2,3668-QPYBK,No,DSL,Yes,Yes,No,No,No,No
3,7795-CFOCW,No phone service,DSL,Yes,No,Yes,Yes,No,No
4,9237-HQITU,No,Fiber optic,No,No,No,No,No,No


In [17]:
cust.shape
# churn.shape
# internet.shape

(7042, 5)

In [25]:
# Since customerID is common to all the dataframes, we will join them using the 'on' argument

# Start by merging cust and churn dataframes
merge = pd.merge(cust, churn, on = 'customerID')

# Merge 'internet' dataframe with 'merge' dataframe
merge = pd.merge(merge, internet, on = 'customerID')
merge.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,Contract,PaperlessBilling,PaymentMethod,...,TotalCharges,Churn,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies
0,7590-VHVEG,Female,0,Yes,No,1,No,Month-to-month,Yes,Electronic check,...,29.85,No,No phone service,DSL,No,Yes,No,No,No,No
1,5575-GNVDE,Male,0,No,No,34,Yes,One year,No,Mailed check,...,1889.5,No,No,DSL,Yes,No,Yes,No,No,No
2,3668-QPYBK,Male,0,No,No,2,Yes,Month-to-month,Yes,Mailed check,...,108.15,Yes,No,DSL,Yes,Yes,No,No,No,No
3,7795-CFOCW,Male,0,No,No,45,No,One year,No,Bank transfer (automatic),...,1840.75,No,No phone service,DSL,Yes,No,Yes,Yes,No,No
4,9237-HQITU,Female,0,No,No,2,Yes,Month-to-month,Yes,Electronic check,...,151.65,Yes,No,Fiber optic,No,No,No,No,No,No


In [26]:
merge.isna().sum()

customerID          0
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
dtype: int64

### Encode Categorical Variables

In [45]:
num = merge.select_dtypes(exclude=object)

In [46]:
cat = merge.select_dtypes(object)
aa = cat['Partner'].unique()

In [47]:
dict(zip(aa, range(len(aa))))

{'Yes': 0, 'No': 1}

In [48]:
for col in cat:
    vals = dict(zip(cat[col].unique(), range(len(cat[col].unique()))))
    cat[col].replace(vals, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._update_inplace(new_data)


In [49]:
cat.head()

Unnamed: 0,customerID,gender,Partner,Dependents,PhoneService,Contract,PaperlessBilling,PaymentMethod,TotalCharges,Churn,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1,1,1,0,1,1,1,1,1,0,1,0,1,1,1,0,0,0
2,2,1,1,0,1,0,0,1,2,1,1,0,1,0,0,0,0,0
3,3,1,1,0,0,1,1,2,3,0,0,0,1,1,1,1,0,0
4,4,0,1,0,1,0,0,0,4,1,1,1,0,1,0,0,0,0


In [52]:
data_comb = pd.concat([cat, num], axis =1)

In [64]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix
from sklearn.preprocessing import MinMaxScaler

In [68]:
model = LogisticRegression(class_weight={0:2, 1:8})
ss = MinMaxScaler()

In [69]:
train, test = train_test_split(data_comb)

train_x, train_y = train.drop('Churn', axis = 1), train.Churn
test_x, test_y = test.drop('Churn', axis = 1), test.Churn

In [70]:
train_x = ss.fit_transform(train_x)
model.fit(train_x, train_y)

LogisticRegression(class_weight={0: 2, 1: 8})

In [71]:
test_x = ss.transform(test_x)

pred = model.predict(test_x)

accuracy_score(test_y, pred)

0.7081203861442362

In [72]:
f1_score(test_y, pred)

0.6175595238095238

In [73]:
confusion_matrix(test_y, pred)

array([[832, 456],
       [ 58, 415]], dtype=int64)

In [67]:
data_comb.Churn.value_counts()

0    5173
1    1869
Name: Churn, dtype: int64