## Importing libraries

In [None]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

---

## Importing data

In [None]:
train_path = os.path.join('data', 'train.csv')
test_path = os.path.join('data', 'test.csv')

train_data = pd.read_csv(train_path)
test_data = pd.read_csv(test_path)

---

## Data overview

### Training Data

In [None]:
train_data.shape

In [None]:
train_data.dtypes

Our possible features are: 

**A_follower_count**: Amount of followers from individual A

**A_following_count**: Amount of accounts individual A follows

**A_listed_count**: Amount of lists individual A is included in

**A_mentions_received**: Amount of times individual A was mentioned

**A_retweets_received**: Amount of retweets received by individual A

**A_mentions_sent**: Amount of times individual A mentioned someone

**A_retweets_sent**: Amount of retweets made by individual A

**A_posts**: Amount of posts made by individual A

**A_network_feature_1**: Description of the local follower network of individual A

**A_network_feature_2**: Description of the local follower network of individual A

**A_network_feature_3**: Description of the local follower network of individual A

**B_follower_count**: Amount of followers from individual B

**B_following_count**: Amount of accounts individual B follows

**B_listed_count**: Amount of lists individual B is included in

**B_mentions_received**: Amount of times individual B was mentioned

**B_retweets_received**: Amount of retweets received by individual B

**B_mentions_sent**: Amount of times individual B mentioned someone

**B_retweets_sent**: Amount of retweets made by individual B

**B_posts**: Amount of posts made by individual B

**B_network_feature_1**: Description of the local follower network of individual B

**B_network_feature_2**: Description of the local follower network of individual B

**B_network_feature_3**: Description of the local follower network of individual B

In [None]:
train_data.head()

### Testing Data

In [None]:
test_data.shape

In [None]:
test_data.dtypes

In [None]:
test_data.head()

## Separating data: features and labels

To make our analysis easy, we will separate the data into:

**X_train**: (possible) training features.

**y_train**: labels.

**X_test**: (possible) test features.

We will call __features__ the list with the names of the columns that might be the features and __label__ the name of the column which has the labels (**Survived** column).

In [None]:
features = list(test_data.columns)
label = 'Choice'

In [None]:
x_train = train_data[features]
y_train = train_data[label]
x_test = test_data[features]

## Training Data Analysis

In [None]:
y_train.head()

In [None]:
x_train.info()

In [None]:
x_train.info()

In [None]:
A_features = x_train.columns[0:11]
B_features = x_train.columns[11:22]

A_train = x_train[A_features].copy()
B_train = x_train[B_features].copy()

AB_features = [s[2:] for s in list(A_features)]
A_rename_cols_dict = dict(zip(A_features, AB_features))
B_rename_cols_dict = dict(zip(B_features, AB_features))

A_train.rename(columns = A_rename_cols_dict, inplace = True)
B_train.rename(columns = B_rename_cols_dict, inplace = True)

AB_train = pd.concat([A_train, B_train], axis = 0)

In [None]:
AB_train.head()

In [None]:
x_train.head()

### Univariate Analysis

#### Choice

In [None]:
y_train.describe()

In [None]:
y_train.value_counts()

In [None]:
A_more_influential = y_train.value_counts()[1]
B_more_influential = y_train.value_counts()[0]
plt.bar(x = ['A', 'B'], height = y_train.value_counts(), color = ['darkblue', 'gray'])
plt.text(0 - 0.06, A_more_influential + 4, f'{A_more_influential}')
plt.text(1 - 0.06, B_more_influential + 4, f'{B_more_influential}')
plt.yticks([])
plt.suptitle('Who is more influential between A and B')
plt.title(f'Influencers are approximately equally distributed between A and B')

plt.show()

#### Follower Count

In [None]:
follower_count = AB_train['follower_count']
follower_count.describe()

In [None]:
follower_count.isnull().value_counts()

In [None]:
fig, axes = plt.subplots(nrows = 1, ncols = 2, figsize = (10, 5))

axes[0].hist(follower_count, bins=np.logspace(0, 8, 9))
axes[0].set_xscale('log')
axes[0].set_title('Follower Count Histogram of Training Set')
axes[0].set_xlabel('Follower Count')
axes[0].set_ylabel('Count')

axes[1].boxplot(follower_count)
axes[1].set_yscale('log')
axes[1].set_title("Boxplot of Follower Count in the training set")
axes[1].set_ylabel('Follower Count')
axes[1].set_xticklabels([])

#### Following Count

In [None]:
following_count = AB_train['following_count']
following_count.describe()

In [None]:
following_count.isnull().value_counts()

In [None]:
fig, axes = plt.subplots(nrows = 1, ncols = 2, figsize = (10, 5))

axes[0].hist(following_count, bins=np.logspace(0, 6, 7))
axes[0].set_xscale('log')
axes[0].set_title('Following Count Histogram of Training Set')
axes[0].set_xlabel('Following Count')
axes[0].set_ylabel('Count')

axes[1].boxplot(following_count)
axes[1].set_yscale('log')
axes[1].set_title("Boxplot of Following Count in the training set")
axes[1].set_ylabel('Following Count')
axes[1].set_xticklabels([])

#### Listed Count

In [None]:
listed_count = AB_train['listed_count']
listed_count.describe()

In [None]:
listed_count.isnull().value_counts()

In [None]:
fig, axes = plt.subplots(nrows = 1, ncols = 2, figsize = (10, 5))

axes[0].hist(listed_count, bins=np.logspace(-1, 6, 8))
axes[0].set_xscale('log')
axes[0].set_title('Listed Count Histogram of Training Set')
axes[0].set_xlabel('Listed Count')
axes[0].set_ylabel('Count')

axes[1].boxplot(listed_count)
axes[1].set_yscale('log')
axes[1].set_title("Boxplot of Listed Count in the training set")
axes[1].set_ylabel('Listed Count')
axes[1].set_xticklabels([])

#### Mentions Received

In [None]:
mentions_received = AB_train['mentions_received']
mentions_received.describe()

In [None]:
mentions_received.isnull().value_counts()

In [None]:
fig, axes = plt.subplots(nrows = 1, ncols = 2, figsize = (10, 5))

axes[0].hist(mentions_received, bins=np.logspace(-1, 6, 8))
axes[0].set_xscale('log')
axes[0].set_title('Mentions Received Histogram of Training Set')
axes[0].set_xlabel('Mentions Received')
axes[0].set_ylabel('Count')

axes[1].boxplot(mentions_received)
axes[1].set_yscale('log')
axes[1].set_title("Boxplot of Mentions Received in the training set")
axes[1].set_ylabel('Mentions Received')
axes[1].set_xticklabels([])

#### Retweets Received

In [None]:
retweets_received = AB_train['retweets_received']
retweets_received.describe()

In [None]:
retweets_received.isnull().value_counts()

In [None]:
fig, axes = plt.subplots(nrows = 1, ncols = 2, figsize = (10, 5))

axes[0].hist(retweets_received, bins=np.logspace(-1, 5, 7))
axes[0].set_xscale('log')
axes[0].set_title('Retweets Received Histogram of Training Set')
axes[0].set_xlabel('Retweets Received')
axes[0].set_ylabel('Count')

axes[1].boxplot(retweets_received)
axes[1].set_yscale('log')
axes[1].set_title("Boxplot of Retweets Received in the training set")
axes[1].set_ylabel('Retweets Received')
axes[1].set_xticklabels([])

#### Mentions Sent

In [None]:
mentions_sent = AB_train['mentions_sent']
mentions_sent.describe()

In [None]:
mentions_sent.isnull().value_counts()

In [None]:
fig, axes = plt.subplots(nrows = 1, ncols = 2, figsize = (10, 5))

axes[0].hist(mentions_sent, bins=np.linspace(0, 80, 8))
axes[0].set_yscale('log')
axes[0].set_title('Mentions Sent Histogram of Training Set')
axes[0].set_xlabel('Mentions Sent')
axes[0].set_ylabel('Count')

axes[1].boxplot(mentions_sent)
axes[1].set_title("Boxplot of Mentions Sent in the training set")
axes[1].set_ylabel('Mentions Sent')
axes[1].set_xticklabels([])

#### Retweets Sent

In [None]:
retweets_sent = AB_train['retweets_sent']
retweets_sent.describe()

In [None]:
retweets_sent.isnull().value_counts()

In [None]:
fig, axes = plt.subplots(nrows = 1, ncols = 2, figsize = (10, 5))

axes[0].hist(retweets_sent, bins=np.linspace(0, 20, 10))
axes[0].set_yscale('log')
axes[0].set_title('Retweets Sent Histogram of Training Set')
axes[0].set_xlabel('Retweets Sent')
axes[0].set_ylabel('Count')

axes[1].boxplot(retweets_sent)
axes[1].set_title("Boxplot of Retweets Sent in the training set")
axes[1].set_ylabel('Retweets Sent')
axes[1].set_xticklabels([])

#### Posts

In [None]:
posts = AB_train['posts']
posts.describe()

In [None]:
posts.isnull().value_counts()

In [None]:
fig, axes = plt.subplots(nrows = 1, ncols = 2, figsize = (10, 5))

axes[0].hist(posts, bins=np.logspace(-1, 3, 5))
axes[0].set_xscale('log')
axes[0].set_yscale('log')
axes[0].set_title('Posts Histogram of Training Set')
axes[0].set_xlabel('Posts')
axes[0].set_ylabel('Count')

axes[1].boxplot(posts)
axes[1].set_yscale('log')
axes[1].set_title("Boxplot of Posts in the training set")
axes[1].set_ylabel('Posts')
axes[1].set_xticklabels([])

#### Network Feature 1

In [None]:
network_feature_1 = AB_train['network_feature_1']
network_feature_1.describe()

In [None]:
network_feature_1.isnull().value_counts()

#### Network Feature 2

In [None]:
network_feature_2 = AB_train['network_feature_2']
network_feature_2.describe()

In [None]:
network_feature_2.isnull().value_counts()

#### Network Feature 3

In [None]:
network_feature_3 = AB_train['network_feature_2']
network_feature_3.describe()

In [None]:
network_feature_3.isnull().value_counts()

---

## Preprocessing Data for Machine Learning

In [None]:
train_data.corr()

In [None]:
y_train.head()

__y_train__ is ok. It's value's are 0 or 1 in the labels and go from 0 to 1 in the predictions, which must hold a threshold to classify the result.

In [None]:
x_train.head()

We will do the following preprocess steps:

- Transform A and B pairs of features into a single __diffrat__ feature ((A - B)/(A + B))
- Change NaN to 0

In [None]:
preprocessed_y_train = y_train.copy()
preprocessed_x_train = pd.DataFrame()
preprocessed_x_test = pd.DataFrame()

In [None]:
A_features = x_train.columns[0:11]
B_features = x_train.columns[11:22]

A_train = x_train[A_features].copy()
B_train = x_train[B_features].copy()
A_test = x_test[A_features].copy()
B_test = x_test[B_features].copy()

AB_features = [s[2:] for s in list(A_features)]

for feature in AB_features:
    preprocessed_x_train[feature] = (A_train['A_' + feature] - B_train['B_' + feature]) / (A_train['A_' + feature] + B_train['B_' + feature])
    preprocessed_x_test[feature] = (A_test['A_' + feature] - B_test['B_' + feature]) / (A_test['A_' + feature] + B_test['B_' + feature])

In [None]:
preprocessed_x_train = preprocessed_x_train.fillna(0)
preprocessed_x_test = preprocessed_x_test.fillna(0)

In [None]:
preprocessed_x_train.head()

### To make this easy, we will save the preprocessed data in .csv files.

In [None]:
preprocessed_y_train_path = os.path.join('data', 'preprocessed_y_train.csv')
preprocessed_x_train_path = os.path.join('data', 'preprocessed_x_train.csv')
preprocessed_x_test_path = os.path.join('data', 'preprocessed_x_test.csv')

preprocessed_y_train.to_csv(preprocessed_y_train_path)
preprocessed_x_train.to_csv(preprocessed_x_train_path)
preprocessed_x_test.to_csv(preprocessed_x_test_path)

### Final Correlations

In [None]:
preprocessed_train_data = pd.concat([y_train, preprocessed_x_train], axis = 1)

In [None]:
preprocessed_train_data.corr()