# MORINGA DATA SCIENCE PHASE 3 Project

Before delving into the details of the project, it is important to clarify which dataset was used. There was an option of 5 different standard datasets that were provided [here](https://drive.google.com/drive/u/4/folders/15bI18N2wQXF11C-CE3ihul_DD7axSQwK), or one had the option of selecting your own dataset. 

I chose to use the [SyriaTel Customer Churn dataset](https://www.kaggle.com/datasets/becksddf/churn-in-telecoms-dataset/data).

## **Preliminaries**

To begin with, all the relevant packages are imported at the very beginning to ensure no packages that are used are missing.

In [2]:
#important libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.outliers_influence import variance_inflation_factor

from sklearn.metrics import mean_squared_error,mean_absolute_error,r2_score
from sklearn.linear_model import LinearRegression

%matplotlib inline

In [43]:
class DataSourcing:
  def __init__(self):
    pass
  
  def open_csv(self, path):
    df = pd.read_csv(path)
    return df
  
  def give_info(self, df):
    message =  f"""
    <------------------------------------------------------------------------DESCRIPTION OF THE DATAFRAME IN QUESTION!!!----------------------------------------------------------------------->
    There are {df.shape[0]} rows and {df.shape[1]} columns in the dataframe.
    There are {len(df.columns)} columns, namely: {df.columns}.  
    ------------------------------------------------------------------------------------------------------------------------->
    
    
    The first 5 records in the df are seen here:
    ------------------------------------------------------------------------------------------------------------------------->
    {df.head()}
    ------------------------------------------------------------------------------------------------------------------------->
    
    
    The last 5 records in the df are as follows: 
    ------------------------------------------------------------------------------------------------------------------------->
    {df.tail()}
    ------------------------------------------------------------------------------------------------------------------------->
    
    
    The descriptive statistics of the df (mean,median, max, min, std) are as follows:
    ------------------------------------------------------------------------------------------------------------------------->
    {df.describe()}
    ------------------------------------------------------------------------------------------------------------------------->
    """
    print (message)
  

In [35]:
class DataPreProcessing(DataSourcing):
  def __init__(self,dataframe):
    self.dataframe = dataframe
    
  def dropColumns(self, df, columns):
    df = df.drop(columns, axis=1)
    return df

  def dropRows(self, df, rows):
    df = df.drop(rows, axis=0)
    return df

In [44]:
data = DataSourcing()
df = data.open_csv("data/telecom.csv")
data.give_info(df)


    <------------------------------------------------------------------------DESCRIPTION OF THE DATAFRAME IN QUESTION!!!----------------------------------------------------------------------->
    There are 3333 rows and 21 columns in the dataframe.
    There are 21 columns, namely: Index(['state', 'account length', 'area code', 'phone number',
       'international plan', 'voice mail plan', 'number vmail messages',
       'total day minutes', 'total day calls', 'total day charge',
       'total eve minutes', 'total eve calls', 'total eve charge',
       'total night minutes', 'total night calls', 'total night charge',
       'total intl minutes', 'total intl calls', 'total intl charge',
       'customer service calls', 'churn'],
      dtype='object').
    The information of the dataframe is as follows, with the datatype of each column being specified: 
    ------------------------------------------------------------------------------------------------------------------------->
  
   

In [37]:
df = pd.read_csv("data/telecom.csv")
df

Unnamed: 0,state,account length,area code,phone number,international plan,voice mail plan,number vmail messages,total day minutes,total day calls,total day charge,...,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls,churn
0,KS,128,415,382-4657,no,yes,25,265.1,110,45.07,...,99,16.78,244.7,91,11.01,10.0,3,2.70,1,False
1,OH,107,415,371-7191,no,yes,26,161.6,123,27.47,...,103,16.62,254.4,103,11.45,13.7,3,3.70,1,False
2,NJ,137,415,358-1921,no,no,0,243.4,114,41.38,...,110,10.30,162.6,104,7.32,12.2,5,3.29,0,False
3,OH,84,408,375-9999,yes,no,0,299.4,71,50.90,...,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False
4,OK,75,415,330-6626,yes,no,0,166.7,113,28.34,...,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3328,AZ,192,415,414-4276,no,yes,36,156.2,77,26.55,...,126,18.32,279.1,83,12.56,9.9,6,2.67,2,False
3329,WV,68,415,370-3271,no,no,0,231.1,57,39.29,...,55,13.04,191.3,123,8.61,9.6,4,2.59,3,False
3330,RI,28,510,328-8230,no,no,0,180.8,109,30.74,...,58,24.55,191.9,91,8.64,14.1,6,3.81,2,False
3331,CT,184,510,364-6381,yes,no,0,213.8,105,36.35,...,84,13.57,139.2,137,6.26,5.0,10,1.35,2,False


In [38]:
class Customer:
  def __init__(self, df, id):
    # self.state = df.iloc[id]['state']
    self.state = df[df['phone number'] == id]['state']
    self.account_length = df.iloc[id]['account length']
    self.area_code = df.iloc[id]['area code']
    self.phone_number = df.iloc[id]['phone number']
    # self.international_plan = df.iloc[id]['international plan']
    # self.vmail_plan = df.iloc[id]['voice mail plan']
    # self.number_vmail_messages = df.iloc[id]['number vmail messages']
    # self.total_day_minutes = df.iloc[id]['total day minutes']
    # self.total_day_calls = df.iloc[id]['total day calls']
    # self.total_day_charge = df.iloc[id]['total day charge']
    # self.total_eve_minutes = df.iloc[id]['total eve minutes']
    # self.total_eve_calls = df.iloc[id]['total eve calls']
    # self.total_eve_charge = df.iloc[id]['total eve charge']
    # self.total_night_minutes = df.iloc[id]['total night minutes']
    # self.total_night_calls = df.iloc[id]['total night calls']
    # self.total_night_charge = df.iloc[id]['total night charge']
    # self.total_intl_minutes = df.iloc[id]['total intl minutes']
    # self.total_intl_calls = df.iloc[id]['total intl calls']
    # self.total_intl_charge = df.iloc[id]['total intl charge']
    # self.customer_service_calls = df.iloc[id]['customer service calls']

In [39]:
James = Customer(df,0)

In [40]:
James.phone_number

'382-4657'