<a href="https://colab.research.google.com/github/Sopralapanca/TwitterDataset-DM-Project/blob/develop/DM_understanding_task1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# What will be done here...

Following the suggestion given from the milestone description, the scope of this notebook is to get a first knowledge of the dataset, seeing the dimension of it and get hints on how to handle it correcty. To achive these scopes we will read the content of the columns and convert them in the right type, in the end we will propose some plots that contain first informations on how the data are distributed. 

A deeper analysis will be done after the cleaning and substitution of wrong rows in the next notebook **Data Preparation**.

Task 1.1: Data Understanding

Explore the dataset with the analytical tools studied and write a concise “data understanding”
report assessing data quality, the distribution of the variables and the pairwise correlations.
Subtasks of DU:

1. Data semantics for each feature that is not described above and the new one defined
by the team
2. Distribution of the variables and statistics
3. Assessing data quality (missing values, outliers, duplicated records, errors)
4. Variables transformations
5. Pairwise correlations and eventual elimination of redundant variables

# Import libraries and load the data

In [None]:
!pip install calmap

In [None]:
# Import libraries
import pandas as pd
from pandas import DataFrame

import numpy as np

import seaborn as sns

import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure

import math
import calendar
import calmap
import os

from os import path
from sys import getsizeof

In [None]:
tweet_path = "/data/tweets.csv"
user_path = "/data/users.csv"

# max_rows is used to load a portion of the dataset

max_rows = 0
 
users_df = pd.read_csv(user_path) 

if max_rows != 0:
  tweets_df = pd.read_csv(tweet_path, nrows=max_rows, encoding="UTF-8")
else:
  tweets_df = pd.read_csv(tweet_path, encoding="UTF-8")

In [None]:
# max_rows is used to load a portion of the dataset

max_rows = 0
 
users_df = pd.read_csv("./users.csv") 

if max_rows != 0:
  tweets_df = pd.read_csv("./tweets.csv", nrows=max_rows, encoding="UTF-8")
else:
  tweets_df = pd.read_csv("./tweets.csv", encoding="UTF-8")

# **Data Understanding**

---



## Data Semantics

From the project specifications we have:

USERS CSV

1. User Id: an incremental identifier for the user
2. Statues Count: the count of the tweets made by the user at the moment of data
crawling (it is involving only the tweets)
3. Lang: the user’s language selected, there are listed also slangs derived from the country
4. Created at: the timestamp in which the profile was created, many dates are wrong
5. Label: a binary variable that indicates if a user is a bot or a genuine user

TWEETS CSV

1. ID: an incremental identifier for the tweet, reply or comment
2. User Id: a unique identifier for the user who wrote the tweet
3. Retweet count: number of retweets for the tweet in analysis
4. Reply count: number of reply for the tweet in analysis
5. Favorite count: number of likes received 
6. Num hashtags: number of hashtags used in the tweet
7. Num urls: number of urls in the tweet
8. Num mentions: number of mentions in the tweet
9. Created at: when the tweet was created, many are wrong
10. Text: the text of the tweet

#### Tweet.csv informations

In [None]:
tweets_df.info()

In [None]:
tweets_df.head(2)

#### User.csv informations

In [None]:
users_df.info()

In [None]:
users_df.head(2)

## Assessing data quality

**Checking if there are any missing values and count them**

In [None]:
def nan_unique_count(df: DataFrame):
  print('| {:>15} | {:>15}| {:>15} |'.format(*["column", "unique values", "NaN" ]))
  print('------------------------------------------------------')
  for col in df.columns:
    print('| {:>15} | {:>15}| {:>15} |'.format(*[col, len(df[col].unique()), df[col].isna().sum() ]))

In [None]:
nan_unique_count(tweets_df)

In [None]:
nan_unique_count(users_df)

As shown above there are some null values inside the two dataframes. In addition, the info method of pandas gives us information about the type of attributes in the dataframe. As you can see all the features in the tweets dataframe are of type "object" this means that non-numeric values are present in attributes that should be numbers such as id, user_id and so on. The data is therefore to be cleaned and properly transformed to the right type.

**Anomalies on numeric fields**

In [None]:
# to check if non-numeric values are present in the dataset  
# we throw an exception when we try to convert the feature to the correct type.

tweets_cols = ["id", "user_id", "retweet_count", "reply_count",
           "favorite_count", "num_hashtags",  "num_urls", "num_mentions"]

user_cols = ["id", "statuses_count"]

# checking non-numeric values inside tweets df
for col in tweets_cols:
    try:
        pd.to_numeric(tweets_df[col], errors='raise')
    except Exception as e:
      print(f"column: {col} error: {e}")

# checking non-numeric values inside users df
for col in user_cols:
    try:
        pd.to_numeric(users_df[col], errors='raise')
    except Exception as e:
      print(f"column: {col} error: {e}")

In [None]:
# count non-numeric values on tweets dataframe

for col in tweets_cols:
  mask = pd.to_numeric(tweets_df[col], errors='coerce').isna()
  a = mask.sum()

  print(f"column {col} has {a} non-numeric values")

del mask

**Anomalies on datetime**

In [None]:
# checking correct datetime in tweets df and user df

try:
  pd.to_datetime(tweets_df["created_at"], errors='raise')
except Exception as e:
  print(e)

try:
  pd.to_datetime(users_df["created_at"], errors='raise')
except Exception as e:
  print(e)

no error in datetime format has been found

**Anomalies on languages**

In [None]:
print(users_df["lang"].unique())

We can immediately notice erroneous values such as "Select Language..." or the repetition of "zh-tw/zh-TW". These values will be cleaned in the data cleaning section.

In [None]:
users_df.loc[users_df['lang'] == 'Select Language...']

In [None]:
tweets_df.loc[tweets_df['user_id'] == '2956613720'].head(2)

In [None]:
tweets_df.loc[tweets_df['user_id'] == '2904858613'].head(2)

In [None]:
users_df.loc[users_df['lang'] == 'xx-lc']

In [None]:
tweets_df.loc[tweets_df['user_id'] == '29552151'].head(2)

We can state that the erroneous languages are from users that writes english tweets

**Anomalies on user id**

In [None]:
# check if there are duplicated ids on users dataframe
users_df[users_df['id'].duplicated() & users_df['id'].notnull()]["id"]

No duplicated ids found

**Anomalies on bot label**

In [None]:
# check if the column is binary
print(users_df['bot'].isin([0,1]).all())

## Assigning correct type to attribute

In [None]:
tweets_ssize = getsizeof(tweets_df)/(1024.0**3)
user_ssize = getsizeof(users_df)/(1024.0**2)
print("Tweets Dataframe specifics : ------------- \n{} - size: {:.2f} GB\n".format(tweets_df.dtypes, tweets_ssize))
print("Users Dataframe specifics:------------- \n{} - size: {:.2f} MB".format(users_df.dtypes, user_ssize))

Converting the binary variables in boolean

In [None]:
users_df['bot'] = users_df['bot'].apply(lambda x: x==1)  

Assigning to date columns the appropriate typo

In [None]:
tweets_df["created_at"]=pd.to_datetime(tweets_df["created_at"]
                                       , errors='coerce', yearfirst=True)

users_df["created_at"]=pd.to_datetime(users_df["created_at"]
                                      , errors='coerce', yearfirst=True)

Transform the numeric and text columns in the smallest integer/float type that fits the values and relatively string, in order to save further memory this process can be repeated after the outlier handling.

In [None]:
# If a value can't be converted to integer a NaN is inserted
# The NaN will be replaced later

numeric_columns = ["id", "user_id", "retweet_count", 
                   "reply_count", "favorite_count", "num_hashtags",  
                   "num_urls", "num_mentions"]

for col in numeric_columns:
    tweets_df[col] = pd.to_numeric(tweets_df[col], 
                                   errors='coerce', downcast='integer')

users_df['statuses_count'] = pd.to_numeric(users_df['statuses_count'], 
                                           errors='coerce', downcast='integer')

Converting the text columns in string

In [None]:
tweets_df['text'] = tweets_df['text'].astype('string')

users_df['name'] = users_df['name'].astype('string')
users_df['lang'] = users_df['lang'].astype('string')

Let's check if all the operations are been performed correctly and how much space we have saved.

In [None]:
tweets_esize = getsizeof(tweets_df)/(1024.0**3)
user_esize = getsizeof(users_df)/(1024.0**2)
print("Tweets Dataframe specifics : ------------- \n{} - size: {:.2f} GB\n|||||| SAVED SPACE: {:.1f}% ||||||\n".format(tweets_df.dtypes, tweets_esize, (1-tweets_esize/tweets_ssize)*100))
print("Users Dataframe specifics:------------- \n{} - size: {:.2f} MB\n|||||| SAVED SPACE: {:.1f}% ||||||".format(users_df.dtypes, user_esize, (1-user_esize/user_ssize)*100))

In [None]:
users_df.describe()

In [None]:
tweets_df.describe()

Using pandas' describe method, we can see simple statistics on dataframes. As can be seen in the tweets dataset, there are very large values such as inf and negative values, so we can say that in those columns there is the presence of outliers. 
In the section "Visualising data distributions" we will provide more statistics.

## Visualizing data distributions


In this section we will show the distribution of the data by displaying different plots for various features in the dataset.

In [None]:
color=['#12a0d7']

### Tweets dataset

Substitute inf values

In [None]:
# we substitute inf values with NaN  in order to compute some plots and later we compute the mean
tweets_df.replace([np.inf, -np.inf], np.nan, inplace=True)

**Data distribution of numerical fields**

In [None]:
def multiple_histograms(df: DataFrame, columns):
  fig, axs = plt.subplots(2, 3, sharex=False, sharey=False, dpi=80)
  idx_col = 0

  for i in range(2):
    for j in range(3):

      col = columns[idx_col]
      idx_col +=1
      
      ax = tweets_df[col].plot.hist(bins=6, logy=True,
                                    align='mid',title=col,
                                    grid=True,figsize=(20,10),
                                    ax = axs[i, j], color=color)

      ax.grid(axis='both', alpha=0.5, linestyle='--')
   
      
columns = ["retweet_count", "reply_count", "favorite_count", "num_hashtags",  "num_urls", "num_mentions"] 

multiple_histograms(tweets_df, columns=columns)

As we can see from the scale the number differ in width till a scale of 10^210, but for only few tweets. This can be read as a clear mark of rows outside the normal distribution.

In [None]:
columns = ["retweet_count", "reply_count", "favorite_count", "num_hashtags",  "num_urls", "num_mentions"] 

f, ax = plt.subplots(figsize=(20, 7))
sns.boxplot(data=tweets_df[columns], orient="h")
# Tweak the visual presentation
ax.xaxis.grid(True)
ax.grid(axis='both', alpha=0.5, linestyle='--')
ax.set_xlim(-1000, 1000000)
sns.despine(trim=True, left=True)
plt.show()

In [None]:
def multiple_boxplots(df: DataFrame, columns):
  fig, axs = plt.subplots(2, 3, sharex=False, sharey=False, dpi=80)
  fig.set_size_inches(20, 10)
  idx_col = 0

  for i in range(2):
    for j in range(3):

      col = columns[idx_col]
      idx_col +=1

      ax = tweets_df[col].plot.box(showmeans=True, 
                              grid=True, ax = axs[i, j])
      ax.set_ylim(-10, 1000000)


      ax.grid(axis='both', alpha=0.5, linestyle='--')
   
      
columns = ["retweet_count", "reply_count", "favorite_count", "num_hashtags",  "num_urls", "num_mentions"] 

multiple_boxplots(tweets_df, columns=columns)

As we can see from the boxplots above there are many values collapsed in dense areas, than there are very high values that will be dealt with, mantaining the significative informations, in such a way that we can have the focus in the right spots.

**Distribution of created_at**

In [None]:
years = tweets_df['created_at'].dt.year
years.value_counts().sort_index().plot(kind="bar", logy=True)

As we can see from the plot, there are multiple non sense dates of tweets since there are dates that correspond to tweets when twitter had not yet been created and dates in the future.

In [None]:
sns.set_theme(style="ticks")
f, ax = plt.subplots(figsize=(7, 8))
sns.despine(f)

# Create the histogram setting the column to be represented and the one to overlap
g = sns.histplot(
    tweets_df,
    x=tweets_df['created_at'].dt.month, hue=tweets_df['created_at'].dt.year,
    multiple='layer',
    log_scale=[False, True],
    discrete=True,
    palette='husl'
)

# Tweak the visual presentation
ax.xaxis.set_major_formatter(mpl.ticker.ScalarFormatter())
ax.set_xticks([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
ax.set_xlabel('Months')
sns.move_legend(ax, "upper left", bbox_to_anchor=(1, 1), title='Year')
ax.set_xticklabels([month for month in calendar.month_name[1:]],
                    fontdict={'horizontalalignment': 'center', 'fontsize': 12, 'rotation': 30})
plt.show()

del g, ax

Here we can see all the years tweets distribution over same months

In [None]:
today = pd.to_datetime("today")     # we set today since there are no tweets later than 2020 with meaningful
twitter_birth = pd.to_datetime("2006-03-21")

mask_datesOK = (tweets_df['created_at'] < today) & (tweets_df['created_at'] > twitter_birth)  
  
print("Number of tweets with a not coherent date: ", len(tweets_df[~mask_datesOK]))

**Distribution of lenght of tweets**


In [None]:
ax = tweets_df['text'].str.len().plot.hist(bins=30, logy=True, 
                                           align='mid',
                                           figsize=(10,6), grid=True)
ax.set_xlabel("Length")

ax.grid(axis='both', alpha=0.5, linestyle='--')
xticks = np.arange(0, 430, 15)
ax.set_xticks(xticks)
ax.tick_params(axis='x', labelrotation=-90)
plt.show()
del ax

**Distribution of the tweets based on the IDs**

In [None]:
ax1 = tweets_df[mask_datesOK].plot.scatter(x='created_at', y='id', c=color, s=0.1)
plt.show()
del ax1

Here we do a zoom for the tweets with the correct dates in order to estrapolate some sort of correlation between IDs and dates. As we can see there is more density with the growth of the ID in late dates.

### Users dataset

**Distribution of created_at**

In [None]:
sns.set_theme(style="ticks")
f, ax = plt.subplots(figsize=(7, 8))
sns.despine(f)


# Create the histogram setting the column to be represented and the one to overlap
g = sns.histplot(
    tweets_df,
    x=users_df['created_at'].dt.month, hue=users_df['created_at'].dt.year,
    multiple='layer',
    log_scale=[False, True],
    discrete=True,
    palette='husl'
)

# Tweak the visual presentation
ax.xaxis.set_major_formatter(mpl.ticker.ScalarFormatter())
ax.set_xticks([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
ax.set_xlabel('Months')
sns.move_legend(ax, "upper left", bbox_to_anchor=(1, 1), title='Year')
ax.set_xticklabels([month for month in calendar.month_name[1:]],
                    fontdict={'horizontalalignment': 'center', 'fontsize': 12, 'rotation': 30})
plt.show()

del g, ax

**Distribution of statuses count**

In [None]:
ax = users_df['statuses_count'].plot.hist(bins=30, logy=True, 
                                           align='mid',title="Histogram of statuses_count",
                                           figsize=(10,6), grid=True)


ax.grid(axis='both', alpha=0.5, linestyle='--')

del ax

## Visualizing data distributions by differentiating bots and non-bots.


### Languages of actual users and bot

In [None]:
sns.set_theme(style="ticks")

f, ax = plt.subplots(figsize=(15, 5))
sns.despine(f)

sns.histplot(
    users_df,
    x='lang', hue='bot',
    multiple="stack",
    palette=sns.color_palette("pastel",2),
    edgecolor=".7",
    log_scale = [False, True],
    linewidth=.5,
    stat='count',
).set(title='Differences in the amount of tweets written by bots & non per lang')
ax.set_ylabel("Frequency")
ax.set_xlabel("Language")
ax.grid(axis='both', alpha=0.5, linestyle='--')
ax.set_xticklabels([lang for lang in users_df['lang'].unique()],
                    fontdict={'horizontalalignment': 'center', 'fontsize': 12, 'rotation': 90})
plt.show()

del f, ax

### Percentage of the number of user: Bot vs No-Bot

In [None]:
bots = users_df[users_df['bot'] == 1]
non_bots = users_df[users_df['bot'] == 0]
labels = 'Bots', 'Non-Bots'
sizes = [len(bots), len(non_bots)]
fig1, ax1 = plt.subplots()
ax1.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=90)
plt.show()

### Percentage of the number of tweet :  Bot vs No-Bot

In [None]:
bots = users_df[users_df['bot'] == 1]
non_bots = users_df[users_df['bot'] == 0]

bots_ids = bots['id'].to_list()
tweets_of_bots = tweets_df[tweets_df['user_id'].isin(bots_ids)]

non_bots_ids = non_bots['id'].to_list()
tweets_of_non_bots = tweets_df[tweets_df['user_id'].isin(non_bots_ids)]

labels = 'Bots', 'Non-Bots'
sizes = [len(tweets_of_bots), len(tweets_of_non_bots)]
fig1, ax1 = plt.subplots()
ax1.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=90)
plt.show()

### How long are the tweets written by the bots & non-bots?

In [None]:
bot_mask = users_df['bot'] == True
tweets_by_bot = tweets_df
tweets_by_bot['bot'] = tweets_df['user_id'].isin(users_df[bot_mask]['id'])

sns.set_theme(style="ticks")

f, ax = plt.subplots(figsize=(7, 5))
sns.despine(f)

sns.histplot(
    tweets_by_bot,
    x=tweets_by_bot['text'].str.len(), hue='bot',
    multiple="stack",
    palette=sns.color_palette("pastel",2),
    edgecolor=".7",
    log_scale = [False, True],
    linewidth=.5,
    stat='count',
    binwidth=15,
    binrange=[0, 430],
)
ax.set_ylabel("Frequency")
ax.set_xlabel("Length")
xticks = np.arange(0, 430, 15)
ax.set_xticks(xticks)
ax.grid(axis='both', alpha=0.5, linestyle='--')
ax.tick_params(axis='x', labelrotation=90)
plt.show()

del f, ax

### When were the bots created (years)?

In [None]:
sns.set_theme(style="ticks")

f, ax = plt.subplots(figsize=(7, 5))
sns.despine(f)

sns.histplot(
    users_df,
    x=users_df['created_at'].dt.year, hue='bot',
    multiple="stack",
    palette=sns.color_palette("pastel",2),
    edgecolor=".7",
    log_scale = [False, False],
    linewidth=.5,
    stat='count',
)
ax.set_ylabel("Counts")
ax.set_xlabel("")
ax.grid(axis='both', alpha=0.5, linestyle='')
ax.tick_params(axis='x', labelrotation=30)
plt.show()

del f, ax

### Calendar Heatmaps

Here there are plots showing how much sparse the data are in the years, as we can see most of the days have few tweets compared with the spikes in late 2019 and early 2020.

Calendar heatmap of tweets wrote by bots

In [None]:
bots = users_df[users_df['bot'] == True]
bots_id = bots['id'].to_list()
bots_tweets_df = tweets_df[tweets_df['user_id'].isin(bots_id)]
events = bots_tweets_df['created_at'].value_counts()

calmap.calendarplot(events, monthticks=3, daylabels='MTWTFSS',
                    dayticks=[0, 2, 4, 6], cmap='YlGn',
                    fillcolor='grey', 
                    linewidth=1.5,
                    fig_kws=dict(figsize=(30, 20)))
plt.show()

Calendar heatmap of tweets wrote by non-bots


In [None]:
bots = users_df[users_df['bot'] == False]
bots_id = bots['id'].to_list()
bots_tweets_df = tweets_df[tweets_df['user_id'].isin(bots_id)]
events = bots_tweets_df['created_at'].value_counts()

calmap.calendarplot(events, monthticks=3, daylabels='MTWTFSS',
                    dayticks=[0, 2, 4, 6], cmap='YlGn',
                    fillcolor='grey', 
                    linewidth=1.5,
                    fig_kws=dict(figsize=(30, 50)))
plt.show()

Calendar heatmap of creation of bots


In [None]:
bots = users_df[users_df['bot'] == True]
events = bots['created_at'].value_counts()

calmap.calendarplot(events, monthticks=3, daylabels='MTWTFSS',
                    dayticks=[0, 2, 4, 6], cmap='YlGn',
                    fillcolor='grey', 
                    linewidth=1.5, 
                    fig_kws=dict(figsize=(30, 20)))
plt.show()

Calendar heatmap of creation of non-bots


In [None]:
non_bots = users_df[users_df['bot'] == False]
events = non_bots['created_at'].value_counts()

calmap.calendarplot(events, monthticks=3, daylabels='MTWTFSS',
                    dayticks=[0, 2, 4, 6], cmap='YlGn',
                    fillcolor='grey', 
                    linewidth=1.5,
                    fig_kws=dict(figsize=(30, 20)))
plt.show()