<a href="https://colab.research.google.com/github/Freemanlabs/giz-rwanda-ai-training/blob/master/02_eda/intro_to_eda.ipynb" target="_blank">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

# Tutorial: Introduction to Exploratory Data Analysis

This tutorial guides you through the basics of conducting exploratory data analysis (EDA) using Python, from loading data to generating insights through data visualizations.

The notebook used in this tutorial examines customer data and showing customers who have left last month and demonstrates how to load, clean, and explore data.

## What is EDA?

Exploratory data analysis (EDA) is a critical initial step in the data science process that involves analyzing and visualizing data to:

- Uncover its main characteristics.
- Identify patterns and trends.
- Detect anomalies.
- Understand relationships between variables.

EDA provides insights into the dataset, facilitating informed decisions about further statistical analyses or modeling.

This step is very important especially when we arrive at modeling the data in order to apply Machine Learning.

## Importing the required libraries for EDA

You start by importing all necessary libraries for data science and analysis.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pylab as plt
import seaborn as sns

## 1. Understand the Data

Understanding the basics of the dataset is crucial for any data science project. It involves familiarizing oneself with the structure, types, and quality of the data at hand.

### Load the Data

Then, you read in the data as a pandas DataFrame.

In [None]:
# df = pd.read_csv("data/CustomerChurn.csv")
df = pd.read_csv("https://raw.githubusercontent.com/Freemanlabs/giz-rwanda-ai-training/master/02_eda/data/CustomerChurn.csv")

### Getting Insights About the Dataset

- `.shape`: Dimensions
- `head()` and `tail()`: Data rows
- `.info()`: Data types and nulls
- `.dtypes`: Data types
- `.describe()`: Summary statistics

The `df.shape` command returns the dimensions of the DataFrame, giving you a quick overview of the number of rows and columns.

Let's see the shape of the data using the shape.

In [None]:
df.shape

The `df.head()` method shows the first few rows of the dataset

In [None]:
df.head()

The `df.tail()` method shows the last few rows of the dataset

In [None]:
df.tail()

The `df.info()` method lets us see the columns and their data types. 

In [None]:
# Concise Summary of the dataframe, as we have too many columns, we are using the verbose = True mode
df.info(verbose = True) 

The `df.dtypes` command provides the data types of each column, helping you understand the kind of data you are dealing with.

In [None]:
df.dtypes

The `df.describe()` command generates descriptive statistics for numerical columns, such as mean, standard deviation, and percentiles, which can help you identify patterns, detect anomalies, and understand the distribution of your data.

In [None]:
df.describe()

`SeniorCitizen` is actually a categorical hence the 25%-50%-75% distribution is not propoer

75% customers have tenure less than 55 months

Average Monthly charges are USD 64.76 whereas 25% customers pay more than USD 89.85 per month

In [None]:
df['Churn'].value_counts().plot(kind='barh', figsize=(8, 6))
plt.xlabel("Count", labelpad=14)
plt.ylabel("Target Variable", labelpad=14)
plt.title("Count of TARGET Variable per category", y=1.02)

In [None]:
100*df['Churn'].value_counts()/len(df['Churn'])

In [None]:
df['Churn'].value_counts()

- Data is highly imbalanced, ratio = 73:27
- So we analyse the data with other features while taking the target values separately to get some insights.

## 2. Data Cleaning

Cleaning data is a vital step in EDA to ensure the dataset is accurate, consistent, and ready for meaningful analysis. This process involves several key tasks to ensure the data is ready for analysis, including:

- Identifying and removing any duplicate data.
- Handling missing values, which might involve replacing them with a specific value or removing the affected rows.
- Standardizing data types (for example, converting strings to datetime) through conversions and transformations to ensure consistency. You might also want to convert data to a format that's easier for you to work with.

This cleaning phase is essential as it improves the quality and reliability of the data, enabling more accurate and insightful analysis.

In [None]:
telco_data = df.copy()

### Correcting Data Types

Ensure data types are appropriate for analysis. For example, converting:

- categorical variables to the correct type
- dates to `datetime`
- numbers stored as strings to `float/int`

`TotalCharges` should be numeric amount. Let's convert it to numerical data type

In [None]:
telco_data.TotalCharges = pd.to_numeric(telco_data.TotalCharges, errors='coerce')

### Handling Missing Values

Missing values can affect analysis results. Common techniques include filling (`.fillna()`) in missing values or dropping (`.dropna()`) rows/columns.

**Filling the Missing Values – Imputation**

In this case, we will be filling the missing values with a certain number.

The possible ways to do this are:

- Filling the missing data with the mean or median value if it’s a numerical variable.
- Filling the missing data with mode if it’s a categorical value.
- Filling the numerical value with 0 or -999, or some other number that will not occur in the data. This can be done so that the machine can recognize that the data is not real or is different.
- Filling the categorical value with a new type for the missing values.



In [None]:
# check to make sure we do not have missing values
telco_data.isnull().sum()

Since the % of these records compared to total dataset is very low i.e. 0.15%, it is safe to ignore them from further processing.

In [None]:
#Removing missing values 
telco_data.dropna(inplace=True)

#telco_data.fillna(0)

### Handling Duplicates

Check if the data has any duplicate rows or columns. If so, remove them.

In [None]:
# Check for duplicate rows
duplicate_rows = telco_data.duplicated().sum()

# Check for duplicate columns
duplicate_columns = telco_data.columns[telco_data.columns.duplicated()].tolist()

# Print the duplicates
print("Duplicate rows count:", duplicate_rows)
print("Duplicate columns:", duplicate_columns)

# Drop duplicate rows
telco_data = telco_data.drop_duplicates()

# Drop duplicate columns
telco_data = telco_data.loc[:, ~telco_data.columns.duplicated()]

### Removing Irrelevant Columns

Remove columns not required for processing

In [None]:
#drop column customerID and tenure
telco_data.drop(columns= ['customerID'], axis=1, inplace=True)
telco_data.head()

## 3. Data Exploration

Explore the dataset to gain insights into its structure and distribution.

### Univariate Analysis

Analyze individual variables to understand their distribution and characteristics.

Plot distibution of individual predictors by churn

In [None]:
for i, predictor in enumerate(telco_data.drop(columns=['Churn', 'TotalCharges', 'MonthlyCharges'])):
    plt.figure(i)
    sns.countplot(data=telco_data, x=predictor, hue='Churn')

Convert the target variable `Churn` in a binary numeric variable i.e. Yes=1 ; No = 0

In [None]:
telco_data['Churn'] = np.where(telco_data.Churn == 'Yes',1,0)
telco_data.head()

### Bivariate Analysis

Explore the relationship between two variables to understand their interactions.

In [None]:
new_df1_target0=telco_data.loc[telco_data["Churn"]==0]
new_df1_target1=telco_data.loc[telco_data["Churn"]==1]

In [None]:
def uniplot(df,col,title,hue =None):
    
    sns.set_style('whitegrid')
    sns.set_context('talk')
    plt.rcParams["axes.labelsize"] = 20
    plt.rcParams['axes.titlesize'] = 22
    plt.rcParams['axes.titlepad'] = 30
    
    
    temp = pd.Series(data = hue)
    fig, ax = plt.subplots()
    width = len(df[col].unique()) + 7 + 4*len(temp.unique())
    fig.set_size_inches(width , 8)
    plt.title(title)
    ax = sns.countplot(data = df, x= col, order=df[col].value_counts().index,hue = hue) 
        
    plt.show()
    plt.close()

In [None]:
uniplot(new_df1_target1,col='Partner',title='Distribution of Gender for Churned Customers',hue='gender')

In [None]:
uniplot(new_df1_target0,col='Partner',title='Distribution of Gender for Non Churned Customers',hue='gender')

In [None]:
uniplot(new_df1_target1,col='PaymentMethod',title='Distribution of PaymentMethod for Churned Customers',hue='gender')

In [None]:
uniplot(new_df1_target1,col='Contract',title='Distribution of Contract for Churned Customers',hue='gender')

In [None]:
uniplot(new_df1_target1,col='TechSupport',title='Distribution of TechSupport for Churned Customers',hue='gender')

In [None]:
uniplot(new_df1_target1,col='SeniorCitizen',title='Distribution of SeniorCitizen for Churned Customers',hue='gender')

## Conclusion

These are some of the quick insights from this exercise:

1. Electronic check medium are the highest churners
2. Contract Type - Monthly customers are more likely to churn because of no contract terms, as they are free to go customers.
3. No Online security, No Tech Support category are high churners
4. Non senior Citizens are high churners

Note: There could be many more such insights, so take this as an assignment and try to get more insights :)

Exploratory Data Analysis provides valuable insights through data exploration, cleaning, and visualization. By understanding the fundamental steps of EDA and applying them to market analysis, professionals can make data-driven decisions and uncover hidden trends. Mastering EDA techniques is essential for anyone looking to excel in data science.