# MORINGA DATA SCIENCE PHASE 3 Project

Before delving into the details of the project, it is important to clarify which dataset was used. There was an option of 5 different standard datasets that were provided [here](https://drive.google.com/drive/u/4/folders/15bI18N2wQXF11C-CE3ihul_DD7axSQwK), or one had the option of selecting your own dataset. 

I chose to use the [SyriaTel Customer Churn dataset](https://www.kaggle.com/datasets/becksddf/churn-in-telecoms-dataset/data).

## **Preliminaries**

To begin with, all the relevant packages are imported at the very beginning to ensure no packages that are used are missing.

In [12]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline


from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn import tree
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
from sklearn.model_selection import GridSearchCV

from imblearn.over_sampling import SMOTE

As the concepts of Object-Oriented Programming are now more familiar, the code required to manipulate the data we receive will be split into four classes: 
1. Data Sourcing
2. Data PreProcessing
3. Data Analysis
4. Model Deployment

The classes themselves, along with their respective functions, are listed below. It is crucial that each of the cells is run in order to ensure that there are no errors.

In [13]:
class DataSourcing:
  def __init__(self,dataframe):
    self.original = dataframe
    self.dataframe = dataframe
  
  def give_info(self):
    message =  f"""
    ----------------------------------------------------------------------->
    DESCRIPTION OF THE DATAFRAME IN QUESTION:
    ----------------------------------------------------------------------->
    
    Dataframe information => {self.dataframe.info()}
    ------------------------------------------------------------------------------------------------------------------------->
    
    Dataframe shape => {self.dataframe.shape[0]} rows, {self.dataframe.shape[1]} columns
    ------------------------------------------------------------------------------------------------------------------------->    
    
    There are {len(self.dataframe.columns)} columns, namely: {self.dataframe.columns}.  
    ------------------------------------------------------------------------------------------------------------------------->
        
    The first 5 records in the dataframe are seen here:
    ------------------------------------------------------------------------------------------------------------------------->
    {self.dataframe.head()}
    ------------------------------------------------------------------------------------------------------------------------->
       
    The last 5 records in the self.dataframe are as follows: 
    ------------------------------------------------------------------------------------------------------------------------->
    {self.dataframe.tail()}
    ------------------------------------------------------------------------------------------------------------------------->
    
    The descriptive statistics of the dataframe (mean,median, max, min, std) are as follows:
    ------------------------------------------------------------------------------------------------------------------------->
    {self.dataframe.describe()}
    ------------------------------------------------------------------------------------------------------------------------->
    """
    print (message)
  
  def null_count(self):
    return self.dataframe.isnull().sum()

In [14]:
class DataPreProcessing(DataSourcing):
  def __init__(self, dataframe):
    super().__init__(dataframe)

  def dropColumns(self, columns):
    return self.dataframe.drop(columns, axis=1)

  def dropRows(self, rows):
    return self.dataframe.drop(rows, axis=0)

  def preprocess_data(self):    

    # Calculate total calls
      self.dataframe["total_calls"] = self.dataframe["total day calls"] + self.dataframe["total night calls"] + self.dataframe["total eve calls"] + self.dataframe["total intl calls"]

    # Calculate total minutes
      self.dataframe["total_minutes"] = self.dataframe["total day minutes"] + self.dataframe["total night minutes"] + self.dataframe["total eve minutes"] + self.dataframe["total intl minutes"]

    # Calculate total charges
      self.dataframe["total_charges"] = self.dataframe["total day charge"] + self.dataframe["total night charge"] + self.dataframe["total eve charge"] + self.dataframe["total intl charge"]

    # Drop unnecessary columns
      self.dataframe.drop(columns=["total day calls", "total night calls", "total eve calls", "total intl calls",
                     "total day minutes", "total night minutes", "total eve minutes", "total intl minutes",
                              "total day charge", "total night charge", "total eve charge", "total intl charge"],
                     inplace=True)
      return self.dataframe
    
  def one_hot_encode(self, columns):
    return pd.get_dummies(self.dataframe, columns=columns)

In [15]:
class DataAnalysis(DataPreProcessing, DataSourcing):
    def __init__(self,dataframe):
        super().__init__(dataframe)

    def univariate_analysis(self):

        numeric_columns = self.dataframe.select_dtypes(include=['number']).columns
        categorical_columns = self.dataframe.drop(
            columns=['phone number']).select_dtypes(exclude=['number']).columns
        # Plot histograms and box plots for numeric columns
        for column in numeric_columns:

            plt.figure(figsize=(12, 6))

            # Plot histogram
            plt.subplot(1, 2, 1)
            sns.histplot(self.dataframe[column], bins=30, kde=True)
            plt.title(f'Histogram for {column}')

            # Plot box plot
            plt.subplot(1, 2, 2)
            sns.boxplot(x=self.dataframe[column])
            plt.title(f'Box Plot for {column}')

            plt.show()

        # Plot count plots for categorical columns
        for column in categorical_columns:
            plt.figure(figsize=(12, 6))

            # Plot count plot
            sns.countplot(x=self.dataframe[column])
            plt.title(f'Count Plot for {column}')

            plt.show()

    def bivariate_analysis(self, column_of_interest):

        numeric_columns = self.dataframe.select_dtypes(include=['number']).columns
        categorical_columns = self.dataframe.drop(
            columns=['phone number']).select_dtypes(exclude=['number']).columns

        numeric_columns = self.dataframe.select_dtypes(include='number').columns

        for column in numeric_columns:
            # Plot histplot
            plt.figure(figsize=(12, 6))
            plt.subplot(1, 2, 1)
            sns.histplot(
                x=self.dataframe[column], hue=self.dataframe[column_of_interest], bins=30, kde=True)
            plt.title(f'Histplot for {column} by {column_of_interest}')

        # Plot box plot
            plt.subplot(1, 2, 2)
            sns.boxplot(x=self.dataframe[column_of_interest], y=self.dataframe[column])
            plt.title(f'Box Plot for {column_of_interest} by {column}')

            plt.show()

        # Plot count plots for categorical columns
        for outer in categorical_columns:

            plt.figure(figsize=(12, 6))

            # Plot count plot
            sns.scatterplot(x=self.dataframe[outer], y=self.dataframe["churn"])
            plt.title(f'Count Plot for churn vs.{outer}')

            plt.show()

    def churn_by_state(self):
        self.dataframe.groupby(["state", "churn"]).size().unstack().plot(kind='bar', stacked=True, figsize=(30,10)) 

    def churn_by_area_code(self):
        self.dataframe.groupby(["area code", "churn"]).size().unstack().plot(kind='bar', stacked=True, figsize=(5,5))

    def churn_by_int_plan(self):
        self.dataframe.groupby(["international plan", "churn"]).size().unstack().plot(kind='bar', stacked=True, figsize=(5,5)) 

    def churn_by_vm_plan(self):
        self.dataframe.groupby(["voice mail plan", "churn"]).size().unstack().plot(kind='bar', stacked=True, figsize=(5,5)) 

In [16]:
# NOTE - Remember to consider over/under sampling when training the model

class Modeling(DataPreProcessing):
    def __init__(self,dataframe):
        super().__init__(dataframe)

    def feature_scaling(self):
        scaler = StandardScaler()
        self.dataframe["total day minutes"] = scaler.fit_transform(
            np.array(self.dataframe["total day minutes"]).reshape(-1, 1))
        self.dataframe["total day calls"] = scaler.fit_transform(
            np.array(self.dataframe["total day calls"]).reshape(-1, 1))
        self.dataframe["total day charge"] = scaler.fit_transform(
            np.array(self.dataframe["total day charge"]).reshape(-1, 1))
        self.dataframe["total eve minutes"] = scaler.fit_transform(
            np.array(self.dataframe["total eve minutes"]).reshape(-1, 1))
        self.dataframe["total eve calls"] = scaler.fit_transform(
            np.array(self.dataframe["total eve calls"]).reshape(-1, 1))
        self.dataframe["total eve charge"] = scaler.fit_transform(
            np.array(self.dataframe["total eve charge"]).reshape(-1, 1))
        self.dataframe["total night minutes"] = scaler.fit_transform(
            np.array(self.dataframe["total night minutes"]).reshape(-1, 1))
        self.dataframe["total night calls"] = scaler.fit_transform(
            np.array(self.dataframe["total night calls"]).reshape(-1, 1))
        
    # def label_encoder(self):
    #     le = LabelEncoder()
    #     self.dataframe["churn"] = le.fit_transform(self.dataframe["churn"])
    #     self.dataframe['state'] = le.fit_transform(self.dataframe['state'])
    #     self.dataframe['area code'] = le.fit_transform(self.dataframe['area code'])
    #     self.dataframe['international plan'] = le.fit_transform(
    #         self.dataframe['international plan'])
    #     self.dataframe['voice mail plan'] = le.fit_transform(
    #         self.dataframe['voice mail plan'])
    #     return self.dataframe

    def smote(self):
        X = self.dataframe.drop(columns=["churn", "phone number"], axis=1)
        y = self.dataframe["churn"]
        smote = SMOTE()
        X_resampled, y_resampled = smote.fit_resample(X, y)
        return X_resampled, y_resampled

    def train_test_split(self, target_class, test_size=0.2, random_state=42):
        X = self.dataframe.drop(columns=["churn", "phone number"], axis=1)
        y = self.dataframe["churn"]
        X_train, X_test, y_train, y_test = train_test_split(
            X, y, test_size=test_size, random_state=random_state, stratify=y)
        return X_train, X_test, y_train, y_test

    def logistic(self, X_train, X_test, y_train, y_test):
        model = LogisticRegression()
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        return model.score(X_test, y_test)

    def decision_tree(self, X_train, X_test, y_train, y_test):
        model = DecisionTreeClassifier()
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        return model.score(X_test, y_test)

    def random_forest(self, X_train, X_test, y_train, y_test):
        model = RandomForestClassifier()
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        return  model.score(X_test, y_test)

    def hyperparameter_tuning(self, X_train, X_test, y_train, y_test):
        model = RandomForestClassifier()
        param_grid = {
            'n_estimators': [100, 200, 300],
            'max_depth': [None, 5, 10],
            'min_samples_split': [2, 5, 10],
            'min_samples_leaf': [1, 2, 4]
        }
        grid_search = GridSearchCV(model, param_grid, cv=5)
        grid_search.fit(X_train, y_train)
        best_model = grid_search.best_estimator_
        y_pred = best_model.predict(X_test)
        score3 = best_model.score(X_test, y_test)
        return score3


# model_0 = Modeling()
# model_0.label_encoder(data_copy)
# model_0.smote(data_copy)
# X_train, X_test, y_train, y_test = model_0.train_test_split(
#     data_copy, target_class=("X_train", "y_train"), test_size=0.2, random_state=42)
# score = model_0.logistic(X_train, X_test, y_train, y_test)
# print(score)
# score1 = model_0.decision_tree(X_train, X_test, y_train, y_test)
# print(score1)
# score2 = model_0.random_forest(X_train, X_test, y_train, y_test)
# print(score2)
# score3 = model_0.hyperparameter_tuning(X_train, X_test, y_train, y_test)
# print(score3)

## **1. DATA UNDERSTANDING**

The identification, gathering, and cursory analysis of the data in this part will be carried out by:

- gathering preliminary data, which has been put into a CSV file.
- describing the data that we have at our disposal.
- looking for patterns and correlations in the data.
- confirming the accuracy of the data.

Firstly, to open the dataset provided, we will utilise pandas rather than any specific class. This is because the classes take a dataframe as a positional argument, and pandas provides several methods for opening different kinds of data files.

In [17]:
df_raw = pd.read_csv('./data/telecom.csv')

Now that we have a dataframe of our dataset, we can pass that dataframe to our `DataSourcing` class and begin the process of data understanding.

In [18]:
data_sourcing = DataSourcing(dataframe=df_raw)
data_sourcing.give_info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3333 entries, 0 to 3332
Data columns (total 21 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   state                   3333 non-null   object 
 1   account length          3333 non-null   int64  
 2   area code               3333 non-null   int64  
 3   phone number            3333 non-null   object 
 4   international plan      3333 non-null   object 
 5   voice mail plan         3333 non-null   object 
 6   number vmail messages   3333 non-null   int64  
 7   total day minutes       3333 non-null   float64
 8   total day calls         3333 non-null   int64  
 9   total day charge        3333 non-null   float64
 10  total eve minutes       3333 non-null   float64
 11  total eve calls         3333 non-null   int64  
 12  total eve charge        3333 non-null   float64
 13  total night minutes     3333 non-null   float64
 14  total night calls       3333 non-null   

In [19]:
data_sourcing.null_count()

state                     0
account length            0
area code                 0
phone number              0
international plan        0
voice mail plan           0
number vmail messages     0
total day minutes         0
total day calls           0
total day charge          0
total eve minutes         0
total eve calls           0
total eve charge          0
total night minutes       0
total night calls         0
total night charge        0
total intl minutes        0
total intl calls          0
total intl charge         0
customer service calls    0
churn                     0
dtype: int64