## Data Cleaning & EDA

Before diving deep into specific questions, it's essential to understand the nature and structure of your datasets. Use Python (with libraries like Pandas, Matplotlib, and Seaborn) for this initial data exploration. If there are Data Cleaning problems, fix them accordingly.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### 1. Client Profiles file cleaning

In [3]:
df = pd.read_csv('https://raw.githubusercontent.com/data-bootcamp-v4/lessons/main/5_6_eda_inf_stats_tableau/project/files_for_project/df_final_demo.txt')
df.head()

Unnamed: 0,client_id,clnt_tenure_yr,clnt_tenure_mnth,clnt_age,gendr,num_accts,bal,calls_6_mnth,logons_6_mnth
0,836976,6.0,73.0,60.5,U,2.0,45105.3,6.0,9.0
1,2304905,7.0,94.0,58.0,U,2.0,110860.3,6.0,9.0
2,1439522,5.0,64.0,32.0,U,2.0,52467.79,6.0,9.0
3,1562045,16.0,198.0,49.0,M,2.0,67454.65,3.0,6.0
4,5126305,12.0,145.0,33.0,F,2.0,103671.75,0.0,3.0


In [5]:
# Checking how many columns and rows are in the dataset

df.shape

(70609, 9)

In [7]:
# Summary stats

df.describe()

Unnamed: 0,client_id,clnt_tenure_yr,clnt_tenure_mnth,clnt_age,num_accts,bal,calls_6_mnth,logons_6_mnth
count,70609.0,70595.0,70595.0,70594.0,70595.0,70595.0,70595.0,70595.0
mean,5004992.0,12.05295,150.659367,46.44224,2.255528,147445.2,3.382478,5.56674
std,2877278.0,6.871819,82.089854,15.591273,0.534997,301508.7,2.23658,2.353286
min,169.0,2.0,33.0,13.5,1.0,13789.42,0.0,1.0
25%,2519329.0,6.0,82.0,32.5,2.0,37346.83,1.0,4.0
50%,5016978.0,11.0,136.0,47.0,2.0,63332.9,3.0,5.0
75%,7483085.0,16.0,192.0,59.0,2.0,137544.9,6.0,7.0
max,9999839.0,62.0,749.0,96.0,8.0,16320040.0,7.0,9.0


In [9]:
#Checking how many null values are in our dataset

df.isnull().sum()

client_id            0
clnt_tenure_yr      14
clnt_tenure_mnth    14
clnt_age            15
gendr               14
num_accts           14
bal                 14
calls_6_mnth        14
logons_6_mnth       14
dtype: int64

In [11]:
# Dropping all rows who have all the values missing

df.dropna(how='all', inplace=True)

In [15]:
# Dropping rows in which they have at least one missing value. This is an extreme choice, however, 
# seeing that there are only 14/15 values missing per columns, the data we are losing is very small.

df = df.dropna()

In [17]:
# Checking the number of unique values per column

df.nunique()

client_id           70594
clnt_tenure_yr         54
clnt_tenure_mnth      482
clnt_age              165
gendr                   4
num_accts               8
bal                 70332
calls_6_mnth            8
logons_6_mnth           9
dtype: int64

In [19]:
# Taking 'gendr' column and turning X values into U (which stands for Unknown).

df['gendr'].value_counts()

gendr
U    24122
M    23724
F    22745
X        3
Name: count, dtype: int64

In [21]:
df['gendr'] = df['gendr'].replace('X', 'U')

In [None]:
# Exporting the clean dataset in one csv for further analysis later

df.to_csv('client_profiles_clean.csv', index=False)

### 2. Merging client footprints datasets

In [23]:
df_pt_1 = pd.read_csv(r"https://raw.githubusercontent.com/data-bootcamp-v4/lessons/refs/heads/main/5_6_eda_inf_stats_tableau/project/files_for_project/df_final_web_data_pt_1.txt")
df_pt_2 = pd.read_csv(r"https://raw.githubusercontent.com/data-bootcamp-v4/lessons/refs/heads/main/5_6_eda_inf_stats_tableau/project/files_for_project/df_final_web_data_pt_2.txt")

In [25]:
merged_df = pd.concat([df_pt_1, df_pt_2], ignore_index=True)

In [35]:
merged_df

Unnamed: 0,client_id,visitor_id,visit_id,process_step,date_time
0,9988021,580560515_7732621733,781255054_21935453173_531117,step_3,2017-04-17 15:27:07
1,9988021,580560515_7732621733,781255054_21935453173_531117,step_2,2017-04-17 15:26:51
2,9988021,580560515_7732621733,781255054_21935453173_531117,step_3,2017-04-17 15:19:22
3,9988021,580560515_7732621733,781255054_21935453173_531117,step_2,2017-04-17 15:19:13
4,9988021,580560515_7732621733,781255054_21935453173_531117,step_3,2017-04-17 15:18:04
...,...,...,...,...,...
755400,9668240,388766751_9038881013,922267647_3096648104_968866,start,2017-05-24 18:46:10
755401,9668240,388766751_9038881013,922267647_3096648104_968866,start,2017-05-24 18:45:29
755402,9668240,388766751_9038881013,922267647_3096648104_968866,step_1,2017-05-24 18:44:51
755403,9668240,388766751_9038881013,922267647_3096648104_968866,start,2017-05-24 18:44:34


In [None]:
# Exporting it just in case we want to use it separately

merged_df.to_csv('Digital_Footprint.csv', index=False)

### 3. Merging client footprints with the experiment dataframe

Our dataset is currently split into two parts. The goal is to create a new, combined dataset that includes both the clients’ digital footprints and their experimental group assignment, recorded in a column called Variation. 

With this combined in one file ('experiment_footprints_clients'), and the clients’ profile information ('client_profiles_clean') in the other, we will be then prepared to move forward with our analysis. 

In [27]:
df1 = pd.read_csv('https://raw.githubusercontent.com/data-bootcamp-v4/lessons/refs/heads/main/5_6_eda_inf_stats_tableau/project/files_for_project/df_final_experiment_clients.txt')

In [37]:
df1

Unnamed: 0,client_id,Variation
0,9988021,Test
1,8320017,Test
2,4033851,Control
3,1982004,Test
4,9294070,Control
...,...,...
70604,2443347,
70605,8788427,
70606,266828,
70607,1266421,


In [29]:
experiment_df = df1.merge(merged_df, on='client_id')

In [33]:
experiment_df

Unnamed: 0,client_id,Variation,visitor_id,visit_id,process_step,date_time
0,9988021,Test,580560515_7732621733,781255054_21935453173_531117,step_3,2017-04-17 15:27:07
1,9988021,Test,580560515_7732621733,781255054_21935453173_531117,step_2,2017-04-17 15:26:51
2,9988021,Test,580560515_7732621733,781255054_21935453173_531117,step_3,2017-04-17 15:19:22
3,9988021,Test,580560515_7732621733,781255054_21935453173_531117,step_2,2017-04-17 15:19:13
4,9988021,Test,580560515_7732621733,781255054_21935453173_531117,step_3,2017-04-17 15:18:04
...,...,...,...,...,...,...
449826,9895983,,473024645_56027518531,498981662_93503779869_272484,step_3,2017-06-15 19:52:09
449827,9895983,,473024645_56027518531,498981662_93503779869_272484,step_2,2017-06-15 19:50:37
449828,9895983,,473024645_56027518531,498981662_93503779869_272484,step_1,2017-06-15 19:50:05
449829,9895983,,473024645_56027518531,498981662_93503779869_272484,start,2017-06-15 19:50:00


In [31]:
experiment_df.isnull().sum()

client_id            0
Variation       128522
visitor_id           0
visit_id             0
process_step         0
date_time            0
dtype: int64

In [None]:
experiment_df = experiment_df.dropna()

In [None]:
experiment_df.to_csv('experiment_footprints_clients.csv', index=False)