# Social Impact Theory with Twitter Data
## Tasks
In this assignment you will do the following tasks:

1. Construct the timelines of Twitter users
2. Visualize distributions and scatter plots
3. Fit and visualize a regression model
4. Bootstrapping

### Install requirements. 

The following cell contains all the necessary dependencies needed for this task. If you run the cell everything will be installed.  
* [`pandas`](https://pandas.pydata.org/docs/index.html) is a Python package for creating and working with tabular data. [Here](https://pandas.pydata.org/docs/reference/index.html) is the documentation of `pandas`.
* [`numpy`](https://numpy.org/) is a Python package for mathematical functions. [Here](https://numpy.org/doc/stable/reference/index.html) is the documentation of `numpy`.
* [`matplotlib`](https://matplotlib.org/) is a Python package for creating plots. [Here](https://matplotlib.org/stable/api/index.html) is the documentation of `matplotlib`.
* [`scikit-learn`](https://scikit-learn.org/stable/) is a Python package with different machinelearning algorithms. [Here](https://scikit-learn.org/stable/modules/classes.html) is the documentation of `sklearn`.

In [2]:
! pip install pandas
! pip install numpy
! pip install matplotlib
! pip install scikit-learn




[notice] A new release of pip is available: 24.2 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 24.2 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip





[notice] A new release of pip is available: 24.2 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Collecting scikit-learn
  Downloading scikit_learn-1.5.2-cp312-cp312-win_amd64.whl.metadata (13 kB)
Collecting joblib>=1.2.0 (from scikit-learn)
  Downloading joblib-1.4.2-py3-none-any.whl.metadata (5.4 kB)
Collecting threadpoolctl>=3.1.0 (from scikit-learn)
  Downloading threadpoolctl-3.5.0-py3-none-any.whl.metadata (13 kB)
Downloading scikit_learn-1.5.2-cp312-cp312-win_amd64.whl (11.0 MB)
   ---------------------------------------- 0.0/11.0 MB ? eta -:--:--
   - -------------------------------------- 0.5/11.0 MB 8.2 MB/s eta 0:00:02
   ------ --------------------------------- 1.8/11.0 MB 6.3 MB/s eta 0:00:02
   ------------ --------------------------- 3.4/11.0 MB 7.2 MB/s eta 0:00:02
   -------------------- ------------------- 5.8/11.0 MB 8.6 MB/s eta 0:00:01
   ------------------------------ --------- 8.4/11.0 MB 9.4 MB/s eta 0:00:01
   ---------------------------------------  10.7/11.0 MB 10.0 MB/s eta 0:00:01
   ---------------------------------------- 11.0/11.0 MB 9.5 MB/s eta 0:


[notice] A new release of pip is available: 24.2 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


### Import requirements
The cell below imports all necessary dependancies. Make sure they are installed (see cell above).

In [3]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression

# 1 Construct the timelines of Twitter users

## 1.1 Getting a list of users
In this Task choose a list with a few hundred Twitter users, [here](https://docs.google.com/spreadsheets/d/1tcNy1q_eQH3HXGt-0hkmSNEGbcOUiC5si3kZ6-F0pB8/) you can find some example ids of such lists.  
Retrieve all user informations of every user in the list from 'the users.csv' file and save them in a pandas Dataframe. 


In [7]:
# Your Code goes here!
users_list = pd.read_csv("users.csv")
users_list


Unnamed: 0,username,id,created_at,name,verified,profile_image_url,description,protected,location,url,public_metrics.followers_count,public_metrics.following_count,public_metrics.tweet_count,public_metrics.listed_count,entities.url.urls,entities.description.hashtags,entities.description.urls,pinned_tweet_id,entities.description.mentions
0,RepConnieConway,1544718378987933703,2022-07-06T16:24:50.000Z,Connie Conway,True,https://pbs.twimg.com/profile_images/155671548...,"Mom, grandmom, wife, daughter of the Central V...",False,"Tulare, CA / Washington, D.C.",https://t.co/Iw5uKMYU4t,220,5,60,7,"[{'start': 0, 'end': 23, 'url': 'https://t.co/...","[{'start': 131, 'end': 136, 'tag': 'CA22'}]",,,
1,repmayraflores,1538990997769707523,2022-06-20T21:07:08.000Z,Congresswoman Mayra Flores,True,https://pbs.twimg.com/profile_images/153941811...,Proudly representing Texas's 34th District in ...,False,,,27800,55,277,48,,,,,
2,CongresswomanSC,1484252226646421505,2022-01-20T19:53:08.000Z,Congresswoman Sheila Cherfilus-McCormick,False,https://pbs.twimg.com/profile_images/149180185...,Congresswoman Sheila Cherfilus-McCormick 📍Prou...,False,"Miramar, FL",https://t.co/wE6d2R5SXn,1005,25,284,36,"[{'start': 0, 'end': 23, 'url': 'https://t.co/...","[{'start': 130, 'end': 150, 'tag': 'workingfor...",,,
3,RepMikeCarey,1457745193197780993,2021-11-08T16:21:58.000Z,Congressman Mike Carey,True,https://pbs.twimg.com/profile_images/146027559...,Congressman Mike Carey. Proudly serving Ohio's...,False,,https://t.co/CnOlkJTpaX,1075,283,907,43,"[{'start': 0, 'end': 23, 'url': 'https://t.co/...",,"[{'start': 76, 'end': 99, 'url': 'https://t.co...",,
4,RepShontelBrown,1456381091598700556,2021-11-04T22:01:33.000Z,Rep. Shontel Brown,True,https://pbs.twimg.com/profile_images/147075892...,Representative for Ohio’s Eleventh Congression...,False,,https://t.co/v695zCnmxN,32924,348,911,163,"[{'start': 0, 'end': 23, 'url': 'https://t.co/...",,,1.463532e+18,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
549,SteveDaines,11651202,2007-12-30T05:52:03.000Z,Steve Daines,True,https://pbs.twimg.com/profile_images/146907311...,Serving the people of Montana in the U.S. Sena...,False,"Bozeman, MT",,81287,661,8865,1729,,,,,
550,ChuckGrassley,10615232,2007-11-26T15:17:02.000Z,Chuck Grassley,True,https://pbs.twimg.com/profile_images/921098191...,U.S. Senator. Family farmer. Lifetime resident...,False,Iowa,https://t.co/gGaOVfn75R,742952,12701,11666,6273,"[{'start': 0, 'end': 23, 'url': 'https://t.co/...",,,,"[{'start': 80, 'end': 94, 'username': 'Grassle..."
551,MarkWarner,7429102,2007-07-12T14:03:33.000Z,Mark Warner,True,https://pbs.twimg.com/profile_images/139693385...,"U.S. Senator from Virginia. \nChairman, Senate...",False,Virginia,https://t.co/sIM9jRW0rP,509595,22966,12925,4931,"[{'start': 0, 'end': 23, 'url': 'https://t.co/...",,,,
552,JimInhofe,7270292,2007-07-05T14:39:13.000Z,Sen. Jim Inhofe,True,https://pbs.twimg.com/profile_images/124393715...,United States Senator from the great state of ...,False,Oklahoma,https://t.co/qvRXqSkqYw,96383,343,5408,2428,"[{'start': 0, 'end': 23, 'url': 'https://t.co/...",,,,


From those users we are interested on those who have written at least 100 tweets and that have at least 100 followers. From the remaining set sample 500 at random. Check out pandas conditional indexing [here](https://pandas.pydata.org/pandas-docs/dev/user_guide/indexing.html#boolean-indexing). To randomly get 500 users you can use pandas [`sample`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html) method.

In [None]:
# Your Code goes here!



## 1.2 Loading timelines

Load the `timeline.csv` file.

In [None]:
# Your Code goes here!



## 1.3 Aggregating and arranging data
With the timeline retrieved we want to calculate some metrics from the tweets, especially the mean retweet count, which is also often refered as the social impact. For this you can use pandas [`groupby`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) method. Group the data by the `author_id` and calculate the mean of the retweet count of each user.

In [None]:
# Your Code goes here!



In [None]:
# Your Code goes here!



Next we want to merge the users data with the newly created mean retweet informations. For this you have to merge the users dataframe with the just created dataframe with the retweet mean of each user. Use `pandas` [`merge`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html) method.    

Afterwards remove all unused columns, at the end the dataframe should contain the author ID, name, the follower count and the mean retweet count. 

Attention: The user id in the timeline dataframe (and later on the retweet mean dataframe) are in column `author_id` and the user id on the user dataframe (created with `list_members`) are in column `id`. You can use the keyword arguments `left_on` and `right_on` to merge the two dataframes by the different user id columns.

In [None]:
# Your Code goes here!



In [None]:
# Your Code goes here!



# 3 Visualize distributions and scatter plots

## 3.1 Distribution of the number of followers
Plot the histogram of the number of followers of each users in your dataset. Repeat this with a logarithmic `y` scale. Which one is more skewed?  

You can use pandas [`hist`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.hist.html) method with the keyword argumnet `log` for logarithmic scale, or you can use matplotlibs [`hist`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html) method (don't forget to first create a figure), again with the keyword argument `log` to plot the data. (use bins="xx" to make the plot look more intuitive) 

In [None]:
# Your Code goes here!



## 3.2 Distribution of social impact

Repeat the above task but for the social impact of your users, also look at the logarithmic scale. Again, which one is more skewed?

In [None]:
# Your Code goes here!



## 3.3 Number of followers vs social impact
Create a scatter plot with the number of followers of each user on the x axis and the social impact of each user on the y axis. Both axis should be in logarithmic scale. Is there a relationship?  

Again you can use pandas [`scatter`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.scatter.html) method with `logx` and `logy` set to true or you can use matplotlibs [`scatter`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.scatter.html) method. Here you can use the `set_yscale` and `set_xscale` method of the axis to set them to `'log'`.

In [None]:
# Your Code goes here!



# 4. Fit and visualize a regression model

## 4.1 Fit a linear model

First of all make two new columns on the data frame with the social impact and the follower count. One called `SI` with the logarithm of the amount of retweets, and another called `FC` with the logarithm of the amount of followers. For this you can use numpys log function `np.log(...)`.  

Now fit a linear regression model with sklearn. For this use the class [`LinearRegression`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) to create a linear regression instance and then call the `fit` method. `SI` is used as the dependent variable (target) and `FC` as the independent variable (features).  

Print the model intercept and coefficient. For this you can use the models attributes `coef_` and `intercept_`.

In [None]:
# Your Code goes here!



In [None]:
# Your Code goes here!



## 4.2 Plot the results
Now plot the same scatter plot as in 3.3 additional add a line plot which shows the regression line of the model. For this use the intercept and the coefficient (slope). Does the line fit the data as you expected?  

It is easier to use matplotlib here to add the line plot to the scatter plot. For the line plot you can use matplotlibs [`plot`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.plot.html) method. For the x values you can use numpy's [`np.linspace`](https://numpy.org/doc/stable/reference/generated/numpy.linspace.html#numpy.linspace) method to evenly space x values in a certain range. The y values can be calculated with the intercept and the slope as follows:  
$
\begin{align}
    y = slope \cdot x + intercept
\end{align}
$

In [1]:
# Your Code goes here!



## 4.3 Calculate quality of the fit
Calculate the residuals of the model and save them in a vector. This can be done with following formula:
$
\begin{align}
residual = y_{true} - y_{pred}
\end{align}
$
where $y_{true}$ are the true values of the dependent variable (in our case `SI`) and $y_{pred}$ are the predicted values with the model. To get the predicted values of the model you can use the [`predict`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) method of the model.  

Afterwards calculate the variance of the residuals and the variance of the social impact variable. For this you can use numpy's [`var`](https://numpy.org/doc/stable/reference/generated/numpy.var.html) function. Is the variance of the residuals lower than the variance of the dependent variable? By how much in proportion?

In [None]:
# Your Code goes here!



## 4.4 Distribution of residuals
Plot the histogram of residuals. Do they look normally distributted?  

Again you can use matplotlib as before to plot the histogram.

In [None]:
# Your Code goes here!



# 5. Bootstrapping

## 5.1 One sample
For bootsrapping we first look at creating one sample. For this use the follower and social impact dataframe from before and sample random rows with replacement. This again can be done with pandas [`sample`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sample.html) method and the keyword argument `replace` set to `True`.  

Fit a new linear regression model with this new dataset. What is the value of the coefficient and the intercept now?

In [None]:
# Your Code goes here!



## 5.2 Many bootstrap samples
Now repeat this 10000 times, save the resulting coefficient in a vector.

In [None]:
# Your Code goes here!



## 5.3 Bootstrap histogram

Plot a histogram of the values resulting from the permutations and add a vertical line on the value of the coefficient of the original model (from task 4.1). For adding a vertical line to the histogram in matplotlib you can use the [`axvline`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.axvline.html) method.  

 How far is the line from the center of the histogram?

In [None]:
# Your Code goes here!



# To learn more
* Do you find any relationship between social impact and the amount of followers?
* How sure are you that it is larger than zero? How sure are you that it is lower than 1?
* Is the value of the relationship within the ranges predicted by Social Impact Theory?
* Under that relationship, if I have 1000 followers, how many more followers do I need to double my social impact?