# Practice Session 01+02: Data preparation

Author: <font color="blue">Luca Franceschi</font>

E-mail: <font color="blue">luca.franceschi01@estudiant.upf.edu</font>

Date: <font color="blue">7/10/2024</font>

# 1. Exploratory data analysis 

In [58]:
import pandas as pd
import seaborn as sns
import datetime

import numpy as np
from numpy import array
from numpy import argmax

import matplotlib.pyplot as plt
from matplotlib import pyplot

import sklearn
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import Normalizer
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split

In [59]:
# LEAVE AS-IS

input_dataset = pd.read_csv("device_db.csv", sep=",")

## 1.1. Data types and simple statistics

<font size="+1" color="red">Replace this cell with your code to print the dataset header (column names) and the first five rows of data.</font>

In [None]:
display(input_dataset.head(5))

<font size="+1" color="red">Replace this cell with your code to create and display a dataframe containing one row per column, and with the following fields: name of the column, type, number of distinct elements, and size. The size of all columns should be equal.</font>

In [None]:
column_type = []
for column in input_dataset.columns:
    column_type.append({'name': column, 'type': input_dataset[column].dtype,
                        'distinct': input_dataset[column].nunique(), 'size': input_dataset[column].size})
column_type_df = pd.DataFrame(column_type, columns=['name', 'type', 'distinct', 'size'])
display(column_type_df)

<font size="+1" color="red">Replace this cell with code to create and display a dataframe containing one row per each column of type ``float64`` in the input data, and with the following fields: name of the column, mean, median, min, max -- all computed ignoring NaN values.</font>

In [None]:
basic_descriptors = []

for col in column_type_df[column_type_df.type=='float64'].name:
    series = input_dataset[col]
    basic_descriptors.append({'name': col, 'mean': np.nanmean(series), 'median': np.nanmedian(series), 'min': np.nanmin(series), 'max': np.nanmax(series)})

basic_descriptors_df = pd.DataFrame(basic_descriptors, columns=['name', 'mean', 'median', 'min', 'max'])
pd.options.display.float_format = '{:.2f}'.format
display(basic_descriptors_df)

<font size="+1" color="red">Replace this cell with code to print each column name and then use the `describe` function to print statistics for that column. Include a blank line after each description.</font>

In [None]:
for col in column_type_df.name:
    print(input_dataset[col].describe())
    print()

In [None]:
input_dataset['DURATION_LINE'].size - input_dataset['DURATION_LINE'].isna().sum()

<font size="+1" color="red">Replace this cell with a brief commentary comparing the previous results for **DURATION_LINE** (time that the customer has had a line) with the ones from the `describe` function.</font>

<font size="+1" color="red">Indicate all the differences between the statistics that `describe` computed, and the statistics you computed (e.g., missing or extra computations).</font>

The decribe function makes also quartiles, not only the median (2nd quartile). Also includes standard deviation. The count function also works differently, since its value is, in our manual computation, `size - count(NaN)`. However the describe function does not take into account distinct values.

## 1.2. Inventory of device models

<font size="+1" color="red">Replace this cell with code to display a census of PREVIOUS_DEVICE_MODEL and PREVIOUS_DEVICE_BRAND. You should create and display a dataframe in each case.</font>

In [None]:
pd.options.display.float_format = '{:.2%}'.format

pdm_frequency = pd.DataFrame(input_dataset.PREVIOUS_DEVICE_MODEL.value_counts(normalize=True))  \
    .reset_index(drop=False)                                                        \
    .rename(columns={'PREVIOUS_DEVICE_MODEL': 'Previous_Device_Model', 'proportion': 'Frequency'})

display(pdm_frequency.head(10))

In [None]:
pd.options.display.float_format = '{:.2%}'.format

pdb_frequency = pd.DataFrame(input_dataset.PREVIOUS_DEVICE_BRAND.value_counts(normalize=True))  \
    .reset_index(drop=False)                                                    \
    .rename(columns={'PREVIOUS_DEVICE_BRAND': 'Previous_Device_Brand', 'proportion': 'Frequency'})

display(pdb_frequency)

<font size="+1" color="red">The most common device model and the most common device brand do not match, why do you think it is so? Replace this cell with an explanation.</font>

It might be the case that the market cap of Samsung is greater than the Apple one because they have more device models, however they might not have a single device that was particularly well received by the market.

However it could also be that there is a ton of missing data regarding these columns that lead to this frequencies (e.g.: Samsung Previous_Device_Brand could always written whereas Apple products might be mostly NaN for some reason).

# 2. Feature engineering

## 2.1. Missing values management

<font size="+1" color="red">Replace this cell with your code to print all columns that contain at least one NaN value, and what is the percentage of NaN values in that column. (Create a dataframe with this information, and then display it.)</font>

In [None]:
pd.options.display.float_format = '{:.2%}'.format

def display_NaN_Frequency(dataset):
    nan_info = []

    for col in dataset.columns:
        nan_info.append({'Column_Name': col, 'NaN_Frequency': dataset[col].isna().sum()/dataset[col].size})

    nan_info_pd = pd.DataFrame(nan_info, columns=['Column_Name', 'NaN_Frequency'])
    nan_info_pd = nan_info_pd[nan_info_pd.NaN_Frequency > 0].sort_values(by='NaN_Frequency', ascending=False)

    display(nan_info_pd) # in this case all columns contain at least one NaN value

display_NaN_Frequency(input_dataset)

<font size="+1" color="red">If there is no **PURCHASED\_DEVICE**, **DEVICE\_VALUE**, or **PREVIOUS\_DEVICE\_MODEL**, the row is useless to us. Replace this cell with code to remove those rows.</font>

In [None]:
f'Initially we have {input_dataset.shape[0]} rows'

In [None]:
df02 = input_dataset.dropna(axis=0, subset=['PURCHASED_DEVICE', 'DEVICE_VALUE', 'PREVIOUS_DEVICE_MODEL'], how='any').copy()
f'Now we have {df02.shape[0]} rows'

<font size="+1" color="red">Any NaN value in **DATA\_TRAFFIC\_MONTH\_(1..6)**, **VOICE\_TRAFFIC\_MONTH_(1..6)**, **BILLING\_MONTH_(1..6)**, or **DEVICE\_COST\_MONTH\_(1..6)** should be assumed to be 0. Replace this cell with code to do that imputation.</font>

In [None]:
pd.options.display.float_format = '{:.2%}'.format

df03 = df02.copy()
for i in range(1,7):
    df03.fillna({f'DATA_TRAFFIC_MONTH_{i}': 0}, inplace=True)
    df03.fillna({f'VOICE_TRAFFIC_MONTH_{i}': 0}, inplace=True)
    df03.fillna({f'BILLING_MONTH_{i}': 0}, inplace=True)
    df03.fillna({f'DEVICE_COST_MONTH_{i}': 0}, inplace=True)

# check NaN_Frequency
display_NaN_Frequency(df03)

<font size="+1" color="red">If there is no **LINE\_ACTIVATION\_DATE**, we will assume it is equal to **LAST\_DEVICE\_CHANGE**. Replace this cell with code to do that imputation.</font>

In [None]:
df04 = df03.fillna({'LINE_ACTIVATION_DATE': df03['LAST_DEVICE_CHANGE']}).copy()

# check NaN_Frequency
pd.options.display.float_format = '{:.2%}'.format
display_NaN_Frequency(df04)

# looks good
pd.options.display.float_format = '{:.2f}'.format
# display(df03[df03.LINE_ACTIVATION_DATE.isna()].head(3))
# display(df04[df03.LINE_ACTIVATION_DATE.isna()].head(3))

<font size="+1" color="red">Replace this cell with code to print the header and the first five rows after this processing</font>

In [None]:
display(df04.head(5))

<font size="+1" color="red">Replace this cell with code to print the number of rows of the original dataset, the number of rows of the new dataset, and the percentage of rows that were dropped, as well as the names of the columns that still contain NaN values, if any.</font>

In [None]:
r_orig = input_dataset.shape[0]
r_new = df04.shape[0]

print(f'Rows in the original dataset: {r_orig}')
print('Rows in the new dataset: {} ({:.2%} less)'.format(r_new, (r_orig-r_new)/r_orig))

pd.options.display.float_format = '{:.2%}'.format
display_NaN_Frequency(df04)

## 2.2. Distributions, outliers, and correlations

<font size="+1" color="red">Replace this cell with code to plot a histogram of **DEVICE\_VALUE** and **DURATION\_LINE**. Remember to include a title, and labels on the x axis and y axis</font>

<font size="+1" color="red">Include after each histogram a markdown cell where you indicate if you recognize any specific distribution (normal, exponential, uniform, ...) or any characteristic of the distribution (unimodal, bimodal).</font>

In [None]:
ax = plt.subplot()

ax.set(title='DEVICE_VALUE Histogram', xlabel='DEVICE_VALUE (USD)', ylabel='Count')
sns.histplot(df04.DEVICE_VALUE, kde=False, ax=ax, bins=50)

plt.tight_layout()
plt.show()

The device value seems to have an exponential decrease (from price $0 to around $2000), however there is an important amount of devices that seem to be normally distributed with cost around $3000. It may count as bimodal but I'm not quite sure.

In [None]:
ax = plt.subplot()

ax.set(title='DURATION_LINE Histogram', xlabel='DURATION_LINE (months)', ylabel='Count')
sns.histplot(df04.DURATION_LINE, kde=False, ax=ax, bins=30)

plt.tight_layout()
plt.show()

The duration line histogram seems to be exponentially decreasing and also seems bimodal (peaks at bins 20 and 50, and a valley at around 40).

<font size="+1" color="red">Replace this cell with a series of cells with code to plot a histogram comparing **VOICE\_TRAFFIC\_MONTH\_1** against **VOICE\_TRAFFIC\_MONTH\_6**, and **BILLING\_MONTH\_1** against **BILLING\_MONTH\_6**. Remember to include a title, labels on the x axis and y axis, and a legend.</font>

<font size="+1" color="red">Both plots should use logarithmic scale on the y axis</font>

<font size="+1" color="red">Include after both histograms your comment on the differences between month 1 and month 6.</font>

In [None]:
ax = plt.subplot()

ax.set(title='TRAFFIC_MONTH Comparison', xlabel='Minutes of Traffic', ylabel='Count')
sns.histplot(df04.VOICE_TRAFFIC_MONTH_1, kde=False, ax=ax, binwidth=100, label='Month -1')
sns.histplot(df04.VOICE_TRAFFIC_MONTH_6, kde=False, ax=ax, binwidth=100, color='orange', label='Month -6')

plt.legend()
plt.yscale('log')
plt.tight_layout()
plt.show()

Both distributions seem to be exponentially decreasing and unimodal, however there seems to be more traffic as time passes. In other words, the nearest the mobile purchase the more voice traffic.

In [None]:
ax = plt.subplot()

ax.set(title='BILLING_MONTH Comparison', xlabel='Billing amount (USD)', ylabel='Count')
# sns.histplot(df04.BILLING_MONTH_1, kde=False, ax=ax, binwidth=50, label='Month -1')
# sns.histplot(df04.BILLING_MONTH_6, kde=False, ax=ax, binwidth=50, color='orange', label='Month -6')

# very ugly but it is a mess otherwise
sns.histplot(df04.BILLING_MONTH_1, kde=False, ax=ax, bins=range(-150, 1000, 50), label='Month -1')
sns.histplot(df04.BILLING_MONTH_6, kde=False, ax=ax, bins=range(-150, 1000, 50), color='orange', label='Month -6')

plt.legend()
plt.yscale('log')
plt.tight_layout()
plt.show()

Both distributions seem to follow some kind of unimodal F-distribution since the right-tail is quite longer than the left-tail. There is not a noticeable difference between those billing periods in my opinion. The only remark that could be important is that the month previous to the purchase of a mobile phone some clients were charged negatively (money returned). This could affect the decision of buying a new device.

<font size="+1" color="red">Replace this cell with code to apply **log(x+1)** to **VOICE\_TRAFFIC\_MONTH\_1** and plot its new distribution.</font>

In [None]:
voice_transformed = np.log(df04.VOICE_TRAFFIC_MONTH_1 + 1)

ax = plt.subplot()

ax.set(title='log(BILLING_MONTH_1+1) Histogram', xlabel='Log of billing amount (USD)', ylabel='Count')
sns.histplot(voice_transformed, kde=False, ax=ax, bins=20)

plt.tight_layout()
plt.show()

We can see that the x-axis was now skrunk to a much more manageable range. All the extreme values (that could be outliers) are much more condensed in the right bins. An interesting byproduct of this transformation is that all the negative values are mapped to 0, thus creating a peak in the first bin.

<font size="+1" color="red">Replace this cell with code to create thre boxplots, each of them for one of the  variables **DATA\_TRAFFIC\_MONTH\_6**, **VOICE\_TRAFFIC\_MONTH\_6** and **BILLING\_MONTH\_6**. Remember to include a title and a label for the y axis.</font>

In [None]:
fig, axs = plt.subplots(1, 3)

fig.suptitle('Boxplots')

axs[0].set(ylabel='Mbps')
axs[1].set(ylabel='Minutes')
axs[2].set(ylabel='USD')

df04.boxplot('DATA_TRAFFIC_MONTH_6', ax=axs[0])
df04.boxplot('VOICE_TRAFFIC_MONTH_6', ax=axs[1])
df04.boxplot('BILLING_MONTH_6', ax=axs[2])

plt.tight_layout()
plt.show()

<font size="+1" color="red">Replace this cell with a brief commentary indicating which extreme values would you use as threshold for **outliers** in these variables, by looking at these box plots</font>

There seem to be a lot of outliers in all cases, all of them are "upper" outliers. I would identify as outliers all data points that exceed: 60000 Mbps in the DATA_TRAFFIC_MONTH_6 boxplot, 1000 minutes in the VOICE_TRAFFIC_MONTH_6, and around $600 for BILLING_MONTH_6.

<font size="+1" color="red">Replace this cell with code to calculate the correlation between all traffic attributes (i.e., voice and data), duration line, billing, device cost and device value. Display the result as a table with rows and columns corresponding to columns, and cells indicating correlations. Display the result as an image using ``matshow``</font>

In [None]:
corr = df04.corr(method='pearson', numeric_only=True)

fig = plt.figure(figsize=(9, 6))
ax = fig.add_subplot(111)
cax = ax.matshow(corr)
cbar = fig.colorbar(cax)

ax.xaxis.set_tick_params(which='both', top=False, labeltop=False, bottom=True, labelbottom=True)
ax.set_xticks(range(0, corr.shape[0]), column_type_df[column_type_df.type == 'float64'].name.to_list(), rotation=360-55, ha='left')
ax.set_yticks(range(0, corr.shape[0]), column_type_df[column_type_df.type == 'float64'].name.to_list())

fig.suptitle('Correlation Matrix')
plt.show()

<font size="+1" color="red">Replace this cell with a brief commentary on the results. Is the billing more correlated, in general, with the data traffic or with the voice traffic?</font>

We can see that there are many interesting correlations in this plot. The first thing that I've noticed is that all variables that have an index 1-6 are very positively correlated between them (all of them near the diagonal). Also there is a very negative correlation between MONTHS_LAST_DEVICE and LAST_DEVICE_CHANGE (the smaller the buy date, the larger the phone age), and also between DURATION_LINE and LINE_ACTIVATION_DATE (for the same reason).

It seems that the billing is, in general, more correlated to data traffic than to voice traffic (the color of the block 1-6 is brighter for data traffic than for voice traffic).

## 2.3. Date management and period calculation

<font size="+1" color="red">Replace this cell with code to create and print `latest_change` and `now`.</font>

In [None]:
latest_change = datetime.datetime.strptime(str(int(df04['LAST_DEVICE_CHANGE'].max())), '%Y%m%d')
now = latest_change + datetime.timedelta(30)
print(latest_change)
print(now)

<font size="+1" color="red">Replace this cell with code that replaces the **MONTHS_LAST_DEVICE** column to be equal to the difference, in periods of 30 days, between **LAST_DEVICE_CHANGE** and the `now` variable.</font>

In [83]:
series_converted = pd.to_datetime(df04['LAST_DEVICE_CHANGE'], format='%Y%m%d')
series_converted = (now - series_converted) / (30 * datetime.timedelta(days=1))

df05 = df04.copy()
df05['MONTHS_LAST_DEVICE'] = series_converted
df05.fillna({'MONTHS_LAST_DEVICE': 0}, inplace=True)
df05 = df05.astype({'MONTHS_LAST_DEVICE': 'int'})

<font size="+1" color="red">Replace this cell with code to update the **DURATION_LINE** value to be the difference, in days, between **LINE_ACTIVATION_DATE** and the `now` variable. Indicate the average of **DURATION_LINE** -- what is that in years, approximately?</font>

In [None]:
series_converted2 = pd.to_datetime(df04['LINE_ACTIVATION_DATE'], format='%Y%m%d')
series_converted2 = now - series_converted2

df06 = df05.copy()
df06['DURATION_LINE'] = series_converted2
print(f'The average line duration is about {df06['DURATION_LINE'].mean().days / 365:.2f} years')


## 2.4. Standarization and scaling of numerical variables

<font size="+1" color="red">Replace this cell with code to standardize and min-max scale the **DATA_TRAFFIC_MONTH_1**, **VOICE_TRAFFIC_MONTH_1**, **BILLING_MONTH_1** and **DEVICE_COST_MONTH_1** columns. Save the results in new colums with the same name followed by **_STANDARD** and **_MINMAX** (e.g., DATA\_TRAFFIC\_MONTH\_1\_STAND, DATA\_TRAFFIC\_MONTH\_1\_MINMAX). Plot a histogram for each new variable.</font>

In [85]:
df07 = df06.copy()

df07['VOICE_TRAFFIC_MONTH_1_STANDARD'] = StandardScaler().fit_transform(df07[['VOICE_TRAFFIC_MONTH_1']])
df07['BILLING_MONTH_1_STANDARD'] = StandardScaler().fit_transform(df07[['BILLING_MONTH_1']])
df07['DEVICE_COST_MONTH_1_STANDARD'] = StandardScaler().fit_transform(df07[['DEVICE_COST_MONTH_1']])

df07['VOICE_TRAFFIC_MONTH_1_MINMAX'] = MinMaxScaler().fit_transform(df07[['VOICE_TRAFFIC_MONTH_1']])
df07['BILLING_MONTH_1_MINMAX'] = MinMaxScaler().fit_transform(df07[['BILLING_MONTH_1']])
df07['DEVICE_COST_MONTH_1_MINMAX'] = MinMaxScaler().fit_transform(df07[['DEVICE_COST_MONTH_1']])

In [None]:
fig, axs = plt.subplots(2,3, figsize=(14,10))
axs = axs.flatten()

selection = ['VOICE_TRAFFIC_MONTH_1_STANDARD', 'BILLING_MONTH_1_STANDARD', 'DEVICE_COST_MONTH_1_STANDARD', 'VOICE_TRAFFIC_MONTH_1_MINMAX', 'BILLING_MONTH_1_MINMAX', 'DEVICE_COST_MONTH_1_MINMAX']

for i, sel in enumerate(selection):
    sns.histplot(df07[sel], kde=False, ax=axs[i], bins=40)

fig.suptitle('Histograms of standardized variables')
plt.tight_layout()
plt.show()

## 2.5. Convert categorical columns to dummy binary variables

<font size="+1" color="red">Create variable **PREVIOUS_DEVICE_BRAND_INT_ENCODED** containing an integer encoding of variable **PREVIOUS_DEVICE_BRAND**.</font>

In [87]:
df08 = df07.copy()
df08['PREVIOUS_DEVICE_BRAND_INT_ENCODED'] = LabelEncoder().fit_transform(df08['PREVIOUS_DEVICE_BRAND'])

<font size="+1" color="red">Replace this cell with code to convert **PREVIOUS_DEVICE_MANUF** to dummy binary variables.</font>

In [88]:
df08_dummies = df08.join(pd.get_dummies(df08['PREVIOUS_DEVICE_MANUF'], prefix='manuf'))

## 2.6. Feature generation

<font size="+1" color="red">Replace this cell with code to create from the 6 months of **DATA_TRAFFIC\_MONTH\_[1-6]**, **VOICE_TRAFFIC\_MONTH\_[1-6]**, **BILLING\_MONTH\_[1-6]** and **DEVICE_COST\_MONTH\_[1-6]**, new columns with the mean, maximum, minimum, range (i.e., difference between maximum and minimum) for each element. For instance, column **DATA_TRAFFIC_MEAN** should contain the average of these six numbers: **DATA_TRAFFIC_MONTH_1**, **DATA_TRAFFIC_MONTH_2**, ..., **DATA_TRAFFIC_MONTH_6**.</font>

In [89]:
df09 = df08_dummies.copy()

df09['DATA_TRAFFIC_MONTH_MEAN'] = df09[['DATA_TRAFFIC_MONTH_' + str(i) for i in range(1,7)]].mean(axis=1)
df09['DATA_TRAFFIC_MONTH_MAX'] = df09[['DATA_TRAFFIC_MONTH_' + str(i) for i in range(1,7)]].max(axis=1)
df09['DATA_TRAFFIC_MONTH_MIN'] = df09[['DATA_TRAFFIC_MONTH_' + str(i) for i in range(1,7)]].min(axis=1)
df09['DATA_TRAFFIC_MONTH_RANGE'] = df09['DATA_TRAFFIC_MONTH_MAX'] - df09['DATA_TRAFFIC_MONTH_MIN']

df09['VOICE_TRAFFIC_MONTH_MEAN'] = df09[['VOICE_TRAFFIC_MONTH_' + str(i) for i in range(1,7)]].mean(axis=1)
df09['VOICE_TRAFFIC_MONTH_MAX'] = df09[['VOICE_TRAFFIC_MONTH_' + str(i) for i in range(1,7)]].max(axis=1)
df09['VOICE_TRAFFIC_MONTH_MIN'] = df09[['VOICE_TRAFFIC_MONTH_' + str(i) for i in range(1,7)]].min(axis=1)
df09['VOICE_TRAFFIC_MONTH_RANGE'] = df09['VOICE_TRAFFIC_MONTH_MAX'] - df09['VOICE_TRAFFIC_MONTH_MIN']

df09['BILLING_MONTH_MEAN'] = df09[['BILLING_MONTH_' + str(i) for i in range(1,7)]].mean(axis=1)
df09['BILLING_MONTH_MAX'] = df09[['BILLING_MONTH_' + str(i) for i in range(1,7)]].max(axis=1)
df09['BILLING_MONTH_MIN'] = df09[['BILLING_MONTH_' + str(i) for i in range(1,7)]].min(axis=1)
df09['BILLING_MONTH_RANGE'] = df09['BILLING_MONTH_MAX'] - df09['BILLING_MONTH_MIN']

df09['DEVICE_COST_MONTH_MEAN'] = df09[['DEVICE_COST_MONTH_' + str(i) for i in range(1,7)]].mean(axis=1)
df09['DEVICE_COST_MONTH_MAX'] = df09[['DEVICE_COST_MONTH_' + str(i) for i in range(1,7)]].max(axis=1)
df09['DEVICE_COST_MONTH_MIN'] = df09[['DEVICE_COST_MONTH_' + str(i) for i in range(1,7)]].min(axis=1)
df09['DEVICE_COST_MONTH_RANGE'] = df09['DEVICE_COST_MONTH_MAX'] - df09['DEVICE_COST_MONTH_MIN']

<font size="+1" color="red">Replace this cell with code create an additional column **DEVICE_COST_TO_BILLING_RATIO** containing the ratio between **DEVICE_COST_MEAN** and **BILLING_MEAN** and plot its distribution.</font>

In [None]:
df10 = df09.copy()

df10['DEVICE_COST_TO_BILLING_RATIO'] = df10['DEVICE_COST_MONTH_MEAN'] / df10['BILLING_MONTH_MEAN']

ax = plt.subplot()

ax.set(title='DEVICE_COST_TO_BILLING_RATIO Histogram', xlabel='', ylabel='Count')
sns.histplot(df10['DEVICE_COST_TO_BILLING_RATIO'], kde=False, ax=ax, binwidth=0.5)

plt.tight_layout()
plt.show()

<font size="+1" color="red">Replace this cell with a brief commentary on the distribution of the variable **DEVICE_COST_TO_BILLING_RATIO**. Can you recognize its distribution?</font>

This seems to be an exponentially decreasing distribution, however it has a very long right-tail, thus making this histogram look very weird. We could probably get a better visualization by plotting the `log(x+1)` histogram. What this plot is basically telling us is that, for most of the entries of the dataset, either the device cost is very low or the billing cost is very high (in comparison).

## 2.7. Text parsing/processing

<font size="+1" color="red">Replace this cell with code to use the **PURCHASED_DEVICE** variable to create 3 new columns with the following variables names: **PURCHASED_DEVICE_CODE**, **PURCHASED_DEVICE_MANUFACTURER** and **PURCHASED_DEVICE_MODEL**.</font>

In [None]:
df11 = df10.copy()
pd.set_option('display.max_colwidth', None)
display(df11['PURCHASED_DEVICE'].head(5))

In [None]:
# display(df11['PURCHASED_DEVICE'].str.extract(r'(.*)_(.*)\*\* \*\*(.*)', expand=True).head(5))
tmp = df11['PURCHASED_DEVICE'].str.split('_| ', n=2, expand=True) \
        .rename(columns={0: 'PURCHASED_DEVICE_CODE', 1: 'PURCHASED_DEVICE_MANUFACTURER', 2: 'PURCHASED_DEVICE_MODEL'})

# avoid doing the join operation multiple times
if 'PURCHASED_DEVICE_CODE' not in df11.columns:
    df11 = df11.join(tmp)

display(df11[['PURCHASED_DEVICE_CODE', 'PURCHASED_DEVICE_MANUFACTURER', 'PURCHASED_DEVICE_MODEL']])

<font size="+1" color="red">Replace this cell with code to create two tables: one with the number of devices per manufacturer in **PURCHASED_DEVICE_MANUFACTURER** and one with the number of devices per manufacturer in  **PREVIOUS_DEVICE_MANUF**.

In [None]:
table1 = df11['PURCHASED_DEVICE_MANUFACTURER'].value_counts().reset_index()
table2 = df11['PREVIOUS_DEVICE_MANUF'].value_counts().reset_index()

display(table1)
display(table2)

## 2.8. Splitting and sampling a dataset

<font size="+1" color="red">Replace this cell with code to split the dataset in two separate datasets: one with 70% of the rows and the other with 30% of rows</font>

In [94]:
seed = 1
train, test = train_test_split(df11, train_size=70, test_size=30, random_state=seed)

<font size="+1" color="red">Replace this cell with code to compute the main statistics (mean, standard deviation, min, max, 25%, 50%, 75%) for the variables **DATA_TRAFFIC_MONTH_1**, **VOICE_TRAFFIC_MONTH_1** and **BILLING_MONTH_1** in both training and testing parts of the dataset.</font>

In [None]:
pd.options.display.float_format = '{:.2f}'.format

print('Train split:')
display(train[['DATA_TRAFFIC_MONTH_1', 'VOICE_TRAFFIC_MONTH_1', 'BILLING_MONTH_1']].describe())

print('Test split:')
display(test[['DATA_TRAFFIC_MONTH_1', 'VOICE_TRAFFIC_MONTH_1', 'BILLING_MONTH_1']].describe())

<font size="+1" color="red">Replace this cell with a brief commentary indicating if you find these statistics match between the two splits, or do not match between them.</font>

They match more or less in most of things (we have to take into account that since it is a random split they might differ a bit from each other). However we see very noticeable differences in places where outliers play a big role, such as `min` or `max`.

# 3. Comparing iPhone and Samsung J series users

<font size="+1" color="red">Replace this cell with code to create two dataframes: one with all the attributes of Apple iPhone users and one with all the attributes of Samsung J series users.</font>

In [175]:
apple_df = df11[df11.PURCHASED_DEVICE_MANUFACTURER == 'APPLE']
samsung_df = df11[df11.PURCHASED_DEVICE_MANUFACTURER == 'SAMSUNG']

# display(apple_df.)
# display(samsung_df)

<font size="+1" color="red">Replace this cell with code to compare some variables between the two datasets. Consider 2 or 3 variables, plot together the histograms of each variable in both datasets (including a legend).</font>

In [130]:
# first we want to choose relevant features, to visualize them we can do the following
# import itertools as it
# ignore = ['DATA_TRAFFIC_MONTH_', 'VOICE_TRAFFIC_MONTH_', 'BILLING_MONTH_', 'DEVICE_COST_MONTH_']
# exclude = [excl + str(i) for excl, i in it.product(ignore, range(1,7))] # equivalent to nested for loop

# display(apple_df[list(set(apple_df.columns) - set(exclude))].describe())
# display(samsung_df[list(set(apple_df.columns) - set(exclude))].describe())

In [None]:
# the ones that seem more interesting are DEVICE_VALUE, DATA_TRAFFIC_MONTH_MEAN, DATA_TRAFFIC_MONTH_RANGE
selection = ['DEVICE_VALUE', 'DATA_TRAFFIC_MONTH_MEAN', 'DATA_TRAFFIC_MONTH_RANGE']
display(apple_df[selection].describe())
display(samsung_df[selection].describe())

In [None]:
fig, axs = plt.subplots(2,3, figsize=(14,10), sharex='col')

for i, sel in enumerate(selection):
    sns.histplot(apple_df[sel], kde=False, ax=axs[0][i], bins=40)
    sns.histplot(samsung_df[sel], kde=False, ax=axs[1][i], bins=40)

axs[0][0].annotate('Apple Users', xy=(0, 0.5), xytext=(-axs[0][0].yaxis.labelpad - 5, 0),
                xycoords=axs[0][0].yaxis.label, textcoords='offset points',
                size='large', ha='right', va='center')

axs[1][0].annotate('Samsung Users', xy=(0, 0.5), xytext=(-axs[1][0].yaxis.labelpad - 5, 0),
                xycoords=axs[1][0].yaxis.label, textcoords='offset points',
                size='large', ha='right', va='center')

fig.suptitle('Histograms of selected features separately')
fig.tight_layout()
plt.show()

In [None]:
ax = plt.subplot()

ax.set(title='DEVICE_VALUE Comparison', xlabel='Value (USD)', ylabel='Count (Normalized)') # normalized as if it was a probability

# very ugly but it is a mess otherwise
sns.histplot(apple_df['DEVICE_VALUE'], kde=False, ax=ax, label='Apple', stat='probability', bins=range(0, 9100, 100))
sns.histplot(samsung_df['DEVICE_VALUE'], kde=False, ax=ax, label='Samsung', stat='probability', bins=range(0, 9100, 100), color='orange')

plt.legend()
plt.tight_layout()
plt.show()

In [None]:
ax = plt.subplot()

ax.set(title='DATA_TRAFFIC_MONTH_MEAN Comparison', xlabel='Log Data Traffic (Mbps)', ylabel='Count (Normalized)')

sns.histplot(np.log(apple_df['DATA_TRAFFIC_MONTH_MEAN']+1), kde=False, ax=ax, label='Apple', stat='probability', bins=range(0, 12, 1))
sns.histplot(np.log(samsung_df['DATA_TRAFFIC_MONTH_MEAN']+1), kde=False, ax=ax, label='Samsung', stat='probability', bins=range(0, 12, 1), color='orange')

plt.legend()
plt.tight_layout()
plt.show()

In [None]:
ax = plt.subplot()

ax.set(title=' DATA_TRAFFIC_MONTH_RANGE Comparison', xlabel='Log Data Traffic Ranges (Mbps)', ylabel='Count (Normalized)')

sns.histplot(np.log(apple_df['DATA_TRAFFIC_MONTH_RANGE']+1), kde=False, ax=ax, label='Apple', stat='probability', bins=range(0, 13, 1))
sns.histplot(np.log(samsung_df['DATA_TRAFFIC_MONTH_RANGE']+1), kde=False, ax=ax, label='Samsung', stat='probability', bins=range(0, 13, 1), color='orange')

plt.legend()
plt.tight_layout()
plt.show()

<font size="+1" color="red">Replace this cell with a brief commentary on the differences you found between these two groups of users.</font>

In the first selected feature which represents the device value we can see quite clearly that the Samsung phones follow (aproximately) a bimodal normal distribution that tends to be way cheaper than the Apple one, which has a much higher mean and a heavier right-tail.

In the second and third features, which are kind of related, we are analysing the data traffic of the six months prior to the device purchase, which can give us an idea about the typical phone usage of the different populations. We can see in the first plot that generally Samsung users tend to have less data traffic, while most of the Apple user base have more usage in average (we can see that the right tail of the Apple distribution weights significantly more than its Samsung counterpart). In addition to that, in the second plot we can see that the traffic ranges also differ quite significantly: the Samsung users have a more consistent data usage overall, while the Apple users, similarly as before, have a bigger range of different usages.

To conclude this second analysis, Samsung users generally have a lower data usage, however they are more consistent with it; while the Apple users generally have a higher data usage but that usage might differ a lot from month to month.

<font size="+2" color="#003300">I hereby declare that, except for the code provided by the course instructors, all of my code, report, and figures were produced by myself.</font>