In [None]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all' # to print multiple outputs from the same cell
import math
import utils
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns
import itertools
from operator import index
from collections import defaultdict
from scipy.stats import pearsonr
from datetime import datetime

In [None]:
users_df = pd.read_csv("dataset/users.csv")

# User Data Understanding and Preparation

In users.csv there are the following variables:
1. User Id: a unique identifier of the user
2. Statues Count: According to the teacher, this is the count of the tweets made by the user at the moment of data
crawling. According to [Twitter API docs](https://developer.twitter.com/en/docs/twitter-api/v1/data-dictionary/object-model/user), this is the number of Tweets (including retweets) issued by the user, but not replies (according to Francesca Naretto); since tweets.csv inclues also users' replies note that **there is no link between the number of tweets for each user in tweets.csv and statuses_count**.
3. Lang: the user’s language selected
4. Created at: the timestamp in which the profile was created
5. Label: a binary variable that indicates if a user is a bot or a genuine user

In [None]:
users_df.info(verbose=True, show_counts=True, memory_usage= "deep")

### Attribute type and quality

In the **user** dataset there are 6 columns:

1. The id **column** seems to be ok, all values are integer and there are not null values, we have to check possible duplicates
 
2. We have 1 null value in the **name** column, we also assume that the name could be a string, a number or a special character, the names are not necessarily unique, but maybe it's intresting to check the frequency distribution.

3. In the **lang** column we don't have null values, but we have to check whether there are problems in the pattern used to express the language, we expect a categorical attribute 

4. The **bot** column is numerical as expected (binary), we have to check whether all the numbers are 0 or 1

5. The attribute **created_at** has no null values, but we have to check the correctness of the date, both sintactic and semantic (not too far in the past or in the future)

6. The **status_count** column has 399 of null values, in the non-null values there would semm to be unexpected float

# 1. ID column

In [None]:
print("Number of total IDs:", len(users_df["id"]))
print("Number of unique IDs:", len(pd.unique(users_df["id"])))

As said before one name is null. There are also duplicate names, but this isn't a surprising behaviour, as many people have the same names. By plotting the names' frequencies we can see that there aren't strange phenomena.

Renaming ID to user_id to avoid confusion.

In [None]:
users_df.rename(columns= {"id" : "user_id"}, inplace=True)

# 2. Name Column

In [None]:
print("Number of total names:", len(users_df["name"]))
print("Number of unique names:", len(pd.unique(users_df["name"])))

freq = {}
for n in users_df['name']:
    if n in freq:
        freq[n] += 1
    else:
        freq[n] = 1

number_of_total_names = len(users_df["name"])
not_empty_or_missing_names = []
empty_or_missing_names = []
names_with_only_spaces = []

# iterate over all names looking for errors
for value in users_df["name"]:
    if pd.isna(value) or value == "": # name is nan or is_empty string
        empty_or_missing_names.append(value)
    if str(value).strip() == "":
            names_with_only_spaces.append(value)
    elif not(pd.isna(value) or value == ""):
        not_empty_or_missing_names.append(value)
        
print(f"Number of total names = {number_of_total_names} vs total name values that are not NA or empty = {len(not_empty_or_missing_names)}")
print(f"Number of total names = {number_of_total_names} vs total name values that are NA or empty = {len(empty_or_missing_names)}")

pd.DataFrame({"frequencies": [_ for _ in freq.values()]}).hist(
    column=["frequencies"], 
    log=True, 
    bins=utils.get_sturges_bins(len(freq.values()))
)

We don't see the 1 missing name to be of any significance. So we will just let it be for now. Now let's check the different languages in the "lang" column.

# 3. Language Column

In [None]:
pd.unique(users_df["lang"])  

The "lang" field is composed of [IETF language codes](https://en.wikipedia.org/wiki/IETF_language_tag). By selecting only the unique values it's possible to see that there are some erroneous values:
* "Select Language..." and "xx-lc" seems to be **default values**
* other values are not properly correct (e.g. "zh-cn" instead of "zh-CN")
We propose to check the most common language used by these 'erroneous values' users and provide them with a more fitting language attribute. This will be done after we have analysed the tweets data.

In [None]:
utils.repair_lang_attribute(users_df)
pd.unique(users_df["lang"])

Since wrong values are just the 0.02% of the number of rows they are just dropped *(!!! the actual code in utils.py doesn't drop the default values)*, while the other values are mapped to the correct ones.

# 4. Bot Column

In [None]:
pd.unique(users_df["bot"])

As you can see from the users_df.info() function. The bot attribute is perfectly as expected, all non-null binary values.

With the unique function we validate that the bot values only consist of zeroes and ones.

# 5. Created_at Column

We observe that the created_at coloumn is recognized by pandas as an object, and not as a datetime as we would expect from this attribute. Clean created_at field, by converting string to datetime. We can also note that there are no users whose account was created before (after) the date of creation of the first tweet.

In [None]:
# parsing string to datetime obj
users_df["created_at"] = pd.to_datetime(users_df["created_at"])

before_time_users_df = users_df[users_df["created_at"] < datetime(2006,3,21,9,50,0)]
before_time_users_df.info()

# finding tweets created after dataset release
before_time_users_df = users_df[users_df["created_at"] > datetime(2022,9,29,11,0,0)]
before_time_users_df.info()

No users are created before or after our tresholds. This is good.

# 6. Statuses_count Column

We expect the statuses count to be an integer, but pandas has interpreted it as a float. This is probably due to the presence of NaN values. Checking for NaN values.

In [None]:
users_df[users_df["statuses_count"].isna()]

In [None]:
nan_status_count_users_df = users_df[users_df["statuses_count"].isna()]
list_of_humans = []
list_of_bots = []
for elem in nan_status_count_users_df["bot"]:
    if elem == 1:
        list_of_bots.append(elem)
    elif elem == 0:
        list_of_humans.append(elem)
    else:
        print("Didnt work")

print(f"Users with NaN values for statuses_count, consists of {len(list_of_humans)} humans and {len(list_of_bots)} bots")


We have found 399 accounts of NaN values in the statuses_count column, where all of the accounts belong to humans.

In [None]:
Number_of_users_in_dataset_that_are_bots = users_df.groupby("bot").get_group(1)
Number_of_users_in_dataset_that_are_humans = users_df.groupby("bot").get_group(0)
total = len(Number_of_users_in_dataset_that_are_bots) + len(Number_of_users_in_dataset_that_are_humans)
print(f"Out of the {total} users in our dataset. {len(Number_of_users_in_dataset_that_are_humans)} are humans, and {len(Number_of_users_in_dataset_that_are_bots)} are bots.")
print(f"The dataset consists of {round(100*(len(Number_of_users_in_dataset_that_are_humans)/total),3)}% humans and {round(100*(len(Number_of_users_in_dataset_that_are_bots)/total),3)}% bots respectively.")
print(f"399 of our users are missing their statuses_count values. These humans consist of {round(100*(len(list_of_humans)/total),3)}% of our dataset.")
print(f"These users will be removed.")

We calculate these NaN values to effect 3.467% of all the human users. But we will remove them as we can calculate that the dataset is fairly balanced. Containing 46.854% humans and 53.146% bots. In a perfect world this ratio would be 1:1, but we can balance this later in our training sets should we need to.

In [None]:
users_df.drop(users_df[users_df["statuses_count"].isna()].index, inplace=True)
users_df.info(verbose=True, show_counts=True, memory_usage= "deep")

The Dtype for the statuses_count is still float, even though the NaN values have been dropped. Will try to convert the remaining values to type int64.

In [None]:
users_df["statuses_count"] = users_df["statuses_count"].apply(np.int64)
users_df.info(verbose=True, show_counts=True, memory_usage= "deep")

We should also make sure that our statuses_count columnn only contain numbers >= than 0.

In [None]:
users_df[users_df["statuses_count"] < 0]

We verify that all our statuses_count values are positive or equal to zero. This is good.

### Distribution of variables and statistics
Let's study them!

## 3. Language Column

In [None]:
langs = pd.unique(users_df["lang"]) 
bot_freqs = []
user_freqs = []
for lang in langs:
    user_freqs.append(len(users_df.query(f"lang == '{lang}' & bot == 0")))
    bot_freqs.append(len(users_df.query(f"lang == '{lang}' & bot == 1")))
langs_df = pd.DataFrame({"lang": langs, "bot_freqs": bot_freqs, "user_freqs": user_freqs})
langs_df.plot.bar(x="lang", logy=True)

## 6. Statuses_count Column

In [None]:
users_df.hist(
    column=["statuses_count"], 
    log=True, 
    bins=utils.get_sturges_bins(len(users_df["statuses_count"]))
)

users_df.hist(
    column=["statuses_count"], 
    by="bot", 
    log=True,
    bins=utils.get_sturges_bins(len(users_df["statuses_count"])) #FIX THIS: USES ALL THE SAMPLES, NOT JUST THE BOTS AND THE USERS
)

Performing outliers detection (via a boxplot) in the only numeric column we have in users dataframe: statuses_count

In [None]:
users_df.plot(
    kind="box",
    column="statuses_count",
    logy=True
)

By looking at the boxplot above, we notice the presence of outliers; hence, we replace them with the median.

In [None]:
whisker_lower_bound, whisker_upper_bound = utils.compute_whiskers(users_df["statuses_count"])
statuses_count_median = users_df["statuses_count"].median() 

users_df["statuses_count"].mask(users_df["statuses_count"] > whisker_upper_bound, statuses_count_median, inplace=True)

Let's see how our outliers removal has affected the distribution of statuses count variable

In [None]:
users_df.hist(
    column=["statuses_count"], 
    log=True, 
    bins=utils.get_sturges_bins(len(users_df["statuses_count"]))
)

users_df.hist(
    column=["statuses_count"], 
    by="bot", 
    log=True,
    bins=utils.get_sturges_bins(len(users_df["statuses_count"])) #FIX THIS: USES ALL THE SAMPLES, NOT JUST THE BOTS AND THE USERS
)

## User Data Quality Summary

In [None]:
users_df.info(verbose=True, show_counts=True)
users_df.describe()

In [None]:
Number_of_users_in_dataset_that_are_bots = users_df.groupby("bot").get_group(1)
Number_of_users_in_dataset_that_are_humans = users_df.groupby("bot").get_group(0)
total = len(Number_of_users_in_dataset_that_are_bots) + len(Number_of_users_in_dataset_that_are_humans)
print(f"Out of the {total} users in our dataset. {len(Number_of_users_in_dataset_that_are_humans)} are humans, and {len(Number_of_users_in_dataset_that_are_bots)} are bots.")
print(f"The dataset consists of {round(100*(len(Number_of_users_in_dataset_that_are_humans)/total),3)}% humans and {round(100*(len(Number_of_users_in_dataset_that_are_bots)/total),3)}% bots respectively.")
print(f"399 of our users are missing their statuses_count values. These humans consist of {round(100*(len(list_of_humans)/total),3)}% of our dataset.")
print(f"These users will be removed.")

After cleaning we are left with a "fairly" balanced and generalized dataset, that is ready for further use. The dataset contains approx. 45% human and 55% bot users. 

In [None]:
users_df.to_csv("./dataset/users_dataset_cleaned.csv",index=False)