# Polarization on Twitter
### By Jonathan Gustafsson Frennert

### I. Dependencies

In [1]:
import os

import matplotlib.pyplot as plt
import matplotlib as mpl
from mpl_toolkits import mplot3d
import seaborn as sns

import math
import numpy as np
np.random.seed(42)
import pandas as pd
import dask.dataframe as dd

import warnings
warnings.filterwarnings('ignore')

print("All packages imported!")

All packages imported!


### II. Matplotlib Parameters

In [2]:
mpl.rcParams['figure.dpi'] = 100
mpl.rcParams['font.size'] = 9

In [3]:
# Latex document Text width
latex_width = 390.0

def set_size(width=latex_width, height=latex_width, fraction=1, subplots=(1, 1)):
    """Set figure dimensions to avoid scaling in LaTeX.
    
    Credit to Jack Walton for the function.
    Source: https://jwalton.info/Embed-Publication-Matplotlib-Latex/
    """

    fig_width_pt = width * fraction
    fig_height_pt = height * fraction
    
    inches_per_pt = 1 / 72.27
    
    fig_width_in = fig_width_pt * inches_per_pt
    fig_height_in = fig_height_pt * inches_per_pt * (subplots[0] / subplots[1])

    return (fig_width_in, fig_height_in)

## III. Color Palette

The palette is from the [iWantHue](http://medialab.github.io/iwanthue/) website by Mathieu Jacomy at the Sciences-Po Medialab.

In [4]:
colors = [
    "#ba4c40",
    "#45c097",
    "#573485",
    "#a8ae3e",
    "#8874d9",
    "#69a050",
    "#be64b2",
    "#bc7d36",
    "#5d8ad4",
    "#b94973"
]

## IV. Twitter Dataset

**Provenance:** Ibrahim Sabuncu, "USA Nov.2020 Election 20 Mil. Tweets (with Sentiment and Party Name Labels) Dataset." *IEEE Dataport*, 14 Aug. 2020, doi: https://dx.doi.org/10.21227/25te-j338.

**License:** [Developer Agreement](https://developer.twitter.com/en/developer-terms/agreement)

**Usage Information:** 
- "you may only use the following information for non-commercial, internal purposes (e.g., to improve the functionality of the Services): (a) aggregate Twitter Applications user metrics, such as number of active users or accounts on Twitter Applications; (b) the responsiveness of Twitter Applications; and (c) results, usage statistics, data or other information (in the aggregate or otherwise) derived from analyzing, using, or regarding the performance of the Twitter API."

- "you may not use, or knowingly display, distribute, or otherwise make Twitter Content, or information derived from Twitter Content, available to any entity for the purpose of: (a) conducting or providing surveillance or gathering intelligence, including but not limited to investigating or tracking Twitter users or Twitter Content; (b) conducting or providing analysis or research for any unlawful or discriminatory purpose, or in a manner that would be inconsistent with Twitter users' reasonable expectations of privacy; (c) monitoring sensitive events (including but not limited to protests, rallies, or community organizing meetings); or (d) targeting, segmenting, or profiling individuals based on sensitive personal information, including their health (e.g., pregnancy), negative financial status or condition, political affiliation or beliefs, racial or ethnic origin, religious or philosophical affiliation or beliefs, sex life or sexual orientation, trade union membership, Twitter Content relating to any alleged or actual commission of a crime, or any other sensitive categories of personal information prohibited by law."

<center> <h3>Dataset Contents*</h3> </center>

<center> <h4><code>uselection_tweets_1jul_11nov.csv</code></h4> </center>

| Variable | Format | Description | Example |
| :- | :- | :- | :- | 
| `Created-At`$\,$ | Timestamp$\,$ | Exact creation time of the tweet $\,$ | 7/1/20 7:44 PM |
| `From-User-Id`$\,$ | String$\,$ | Unique ID of the user that sent the tweet $\,$ | 1223446325758394369 |
| `To-User-Id`$\,$ | String$\,$ | Unique ID of the user that tweet sent to, -1 if nobody $\,$ | 387882597 |
| `Language`$\,$ | String$\,$ | Language of tweets that are coded in ISO 639-1. $\,$ | en |
| `PartyName`$\,$ | String$\,$ | The Label showing which party the tweeting is about $\,$ | BothParty |
| `Id`$\,$ | String$\,$ | Unique ID of the tweet $\,$ | 1278368973948694528 |
| `Score`$\,$ | Float$\,$ | The sentiment score of the tweets $\,$ | 0.102564 |

\**only used fields are shown.*

## V. Prepare Twitter Dataset

### Importing

In [5]:
twitter_columns = ['Created-At', 'From-User-Id', 'To-User-Id', 'Language', 'PartyName', 'Id', 'Score']
twitter_filepath = os.path.join(os.getcwd(), 'data', 'twitter', 'uselection_tweets_1jul_11nov.csv')
twitter_data =  dd.read_csv(twitter_filepath, sep= ';', usecols=twitter_columns)

#twitter_data_gen =  pd.read_csv(twitter_filepath, sep= ';', usecols=twitter_columns, chunksize=10000)
#twitter_data = [chunk for chunk in twitter_data_gen]

### Cleaning

#### Correcting Inferred Variable Types

In [6]:
twitter_data.dtypes

Created-At       object
From-User-Id      int64
To-User-Id        int64
Language         object
PartyName        object
Id                int64
Score           float64
dtype: object

- `Created-At` should be a timestamp
- `From-User_Id` should be a string
- `To-User_Id` should be a string
- `Id` should be a string

In [7]:
twitter_data['Created-At'] = dd.to_datetime(twitter_data['Created-At'])
twitter_data['From-User-Id'] = twitter_data['From-User-Id'].astype('str')
twitter_data['To-User-Id'] = twitter_data['To-User-Id'].astype('str')
twitter_data['Id'] = twitter_data['Id'].astype('str')

#for df in twitter_data:
#    df['Created-At'] = pd.to_datetime(df['Created-At'])
#    df['From-User-Id'] = df['From-User-Id'].astype('str')
#    df['To-User-Id'] = df['To-User-Id'].astype('str')
#    df['Id'] = df['Id'].astype('str')

#### Handling NaN Values and Outliers

#### Remove Duplicates

#### Initial Filters

In [8]:
twitter_data = twitter_data[((twitter_data['PartyName'] == 'Republicans') | (twitter_data['PartyName'] == 'Democrats')) 
                            & (twitter_data['Language'] == 'en')]

In [None]:
twitter_data.isna().sum().compute()