# Data Preparation
In the first phase of Coursera project, I explored the data to surge each column, their data type, their distribution, and the extent of their possible contirbution in predicting the retention of each user. In the second phase of Coursera project, I prepare the data for modeling. At this point, our data has no null values and any duplicates. In this phase I dummify the categirical values and remove the columns that didn't show to be any predictive of the target variable.

## Import Required Packages

In [5]:
# Standard Python packages
from math import sqrt
import pickle

# Data packages
import pandas as pd
import numpy as np

# Visualization Packages
from matplotlib import pyplot as plt
import seaborn as sns
%matplotlib inline

# Statisticsl inference package
from scipy import stats
from scipy.stats import t, chi2, norm, f

# Machine Learning / Classification packages
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score, precision_score, \
recall_score, f1_score, confusion_matrix, ConfusionMatrixDisplay, roc_auc_score


In [3]:
# Open pickled model
with open('Data//coursera data.pickle', 'rb') as file:
    df = pickle.load(file)
df.shape

(413953, 38)

In [4]:
df.columns

Index(['subscription_id', 'observation_dt', 'is_retained', 'specialization_id',
       'cnt_courses_in_specialization', 'specialization_domain',
       'is_professional_certificate', 'is_gateway_certificate',
       'learner_days_since_registration', 'learner_country_group',
       'learner_gender', 'learner_cnt_other_courses_active',
       'learner_cnt_other_courses_paid_active',
       'learner_cnt_other_courses_items_completed',
       'learner_cnt_other_courses_paid_items_completed',
       'learner_cnt_other_transactions_past', 'learner_other_revenue',
       'subscription_period_order', 'days_since_last_payment',
       'days_til_next_payment_due',
       'cnt_enrollments_started_before_payment_period',
       'cnt_enrollments_completed_before_payment_period',
       'cnt_enrollments_active_before_payment_period',
       'cnt_items_completed_before_payment_period',
       'cnt_graded_items_completed_before_payment_period',
       'is_subscription_started_with_free_trial',
      

## Dummy Encode Categorical Variables

In [13]:
del dummy

In [14]:
# objects = df.select_dtypes(include='category').columns
df1 = pd.get_dummies(df, columns = ['learner_country_group'], drop_first=True, prefix="", prefix_sep="")
df1.columns

Index(['subscription_id', 'observation_dt', 'is_retained', 'specialization_id',
       'cnt_courses_in_specialization', 'specialization_domain',
       'is_professional_certificate', 'is_gateway_certificate',
       'learner_days_since_registration', 'learner_gender',
       'learner_cnt_other_courses_active',
       'learner_cnt_other_courses_paid_active',
       'learner_cnt_other_courses_items_completed',
       'learner_cnt_other_courses_paid_items_completed',
       'learner_cnt_other_transactions_past', 'learner_other_revenue',
       'subscription_period_order', 'days_since_last_payment',
       'days_til_next_payment_due',
       'cnt_enrollments_started_before_payment_period',
       'cnt_enrollments_completed_before_payment_period',
       'cnt_enrollments_active_before_payment_period',
       'cnt_items_completed_before_payment_period',
       'cnt_graded_items_completed_before_payment_period',
       'is_subscription_started_with_free_trial',
       'cnt_enrollments_started

## Feature Engineering
### Feature Selection
In this step I will remove the following columns:
- **`subscription_id`**: This column is not predictive.
- **`observation_dt`**: This column is not predictive.
- **`specialization_id`, `specialization_domain`**: In the previous phase, I created a new feature, popularity, that measures the popularity of a specialization among learners, so these two column are engineered into a numerical value.
- **`learner_days_since_registration`**: I remove this column for two reasons;
     - The obeservation date is selected randomly, so this column is only a random number.
     - t-test of means showed that there is no statically significant differences in the mean of this columns between churned and retained learners.
- **`learner_gender`**: z-test of proportions showed that there is no statically significant differences in the mean of this columns between churned and retained learners.
- **`learner_country_group`**: Was encoded in the previous section.

In [18]:
'learner_country_group' in df1.columns

False

In [19]:
to_be_removed = ['subscription_id', 'observation_dt', 'specialization_id', 'specialization_domain',
                 'learner_days_since_registration', 'learner_gender']
df1.drop(to_be_removed, axis=1, inplace=True)
df1.shape

(413953, 45)

## Handle Outliers
Before removing the outliers from data it is necessary to justify them. We need to understand why they exist and whether they are relevant or not for our analysis. Outliers can be either natural or artificial. Natural outliers are data points that are valid and reflect the true variability of the data, such as extreme weather events or exceptional performance. Artificial outliers are data points that are invalid and result from errors, noise, or anomalies, such as measurement errors, data entry mistakes, or fraud. We should keep natural outliers and remove artificial outliers, unless they are important for our research question or hypothesis.<br>
So, despite having a lot of outliers in the data, some of which being extremely large, I don't remove the outliers. They are the natural variability of the data.

In [20]:
df1.describe()

Unnamed: 0,is_retained,cnt_courses_in_specialization,learner_cnt_other_courses_active,learner_cnt_other_courses_paid_active,learner_cnt_other_courses_items_completed,learner_cnt_other_courses_paid_items_completed,learner_cnt_other_transactions_past,learner_other_revenue,subscription_period_order,days_since_last_payment,...,cnt_enrollments_completed_during_payment_period,cnt_enrollments_active_during_payment_period,cnt_items_completed_during_payment_period,cnt_graded_items_completed_during_payment_period,sum_hours_learning_before_payment_period,sum_hours_learning_during_payment_period,cnt_days_active_before_payment_period,cnt_days_active_during_payment_period,cnt_days_since_last_activity,Popularity
count,413953.0,413953.0,413953.0,413953.0,413953.0,413953.0,413953.0,413953.0,413953.0,413953.0,...,413953.0,413953.0,413953.0,413953.0,413953.0,413953.0,413953.0,413953.0,413953.0,413953.0
mean,0.543663,5.926644,5.421739,1.28152,114.289243,62.94366,1.966615,101.281377,2.477221,12.398519,...,0.235051,0.681227,21.015077,1.743268,15.596076,3.474735,12.47489,2.585724,29.130725,1.317979
std,0.49809,1.759556,13.81905,3.743266,303.46999,194.27009,6.517248,353.680276,2.352874,8.655959,...,0.655435,0.949701,50.313255,4.422368,27.338639,8.412766,18.937637,4.062458,52.34098,0.551951
min,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.162325
25%,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,5.0,...,0.0,0.0,0.0,0.0,1.15,0.0,2.0,0.0,1.0,0.850538
50%,1.0,6.0,1.0,0.0,4.0,0.0,0.0,0.0,2.0,12.0,...,0.0,0.0,0.0,0.0,6.183333,0.0,6.0,1.0,9.0,0.957605
75%,1.0,7.0,4.0,1.0,90.0,13.0,1.0,49.0,3.0,20.0,...,0.0,1.0,20.0,2.0,18.633333,3.533333,15.0,4.0,32.0,1.995327
max,1.0,13.0,604.0,215.0,19439.0,6912.0,474.0,24069.92,19.0,31.0,...,10.0,11.0,1043.0,149.0,2375.683333,1393.3,616.0,31.0,548.0,1.995327


## Pickle the Data

In [21]:
import pickle
with open('Data//prepared data.pickle', 'wb') as file:
    pickle.dump(df1, file) 
print('Done!')

Done!


## Conclusion
In this phase, I prepared the data for modeling. I encoded the categorical columns, removed irrelevant columns and handled outliers. I didn't tansform or engineer features, I will proceed with the data as it is, if the models don't perform well, I revert back to manipulate features.