# Challenges for week 3

Now that we've seen how to clean in Pandas, it's time for you to apply this knowledge. This week has three challenges. Make sure to give it a try and complete all of them. 

**Some important notes for the challenges:**
1. These challenges are a warming up, and help you get ready for class. Make sure to give them a try. If you get an error message, try to troubleshoot it (using Google often helps). If all else fails, go to the next challenge (but make sure to hand it in).
2. While we of course like when you get all the answers right, the important thing is to exercise and apply the knowledge. So we will still accept challenges that may not be complete, as long as we see enough effort *for each challenge*. This means that if one of the challenges is not delivered (not started and no attempt shown), we unfortunately will not be able to provide a full grade for that week.
3. Delivering the challenge to the right place is a critical part of the challenge. This means we will only be able to grade and accept challenges that are live on your own private GitHub repository (so with a link starting with https://github.com/uva-cw-digitalanalytics/2021s1-) **and** delivered on time as a Canvas assignment. Watch the videos on Canvas on how to hand in your challenges.

### Facing issues? 

We are constantly monitoring the issues on the GitHub general repository (https://github.com/uva-cw-digitalanalytics/2021s1/issues) to help you out. Don't hesitate to log an issue there, explaining well what the problem is, showing the code you are using, and the error message you may be receiving. 

**Important:** We are only monitoring the repository in weekdays, and until 17.00. Issues logged after this time will most likely be answered the next day. This means you should now wait for our response before submitting a challenge :-)

## Getting setup for the challenges

We will use actual Twitter data for the challenges of this week. To do so, you need:
* To download DMI-TCAT data that you may already be collecting for yourself, or from a colleague (if you haven't requested data collection yet). Please use **the same data** that you requested sentiment analysis for
* The sentiment analysis results (get them from SurfDrive)

If you don't have sentiment analysis results, get them from a colleague (in SurfDrive), but then make sure to download also their Twitter data from DMI-TCAT - otherwise the merge won't work.

**All the challenges below are with this Twitter data. Make sure to start your challenge by doing the basics of loading and inspecting the data, even if not specified in challenge itself.**



***

In [133]:
import pandas as pd
twitter = pd.read_csv("tcat_NuriaVila-20210207-20210208------------fullExport--9654fe3ff4.csv")
sentiment = pd.read_pickle("NuriaVila_EN_completed.pkl")

***

## Challenge 1

Create two binary variables for the Twitter data based on the text column. They should be two **meaningful** categories for your data, and they should have either the value 0 (when the tweet is not of that category) or 1 (when the tweet is of that category). 

Make sure to explain (in MarkDown) what these variables are, and provide some descriptives when they are done.

***

In [134]:
twitter.dtypes

id                               int64
time                             int64
created_at                      object
from_user_name                  object
text                            object
filter_level                    object
possibly_sensitive             float64
withheld_copyright             float64
withheld_scope                 float64
truncated                      float64
retweet_count                    int64
favorite_count                   int64
lang                            object
to_user_name                    object
in_reply_to_status_id          float64
quoted_status_id               float64
source                          object
location                        object
lat                            float64
lng                            float64
from_user_id                     int64
from_user_realname              object
from_user_verified               int64
from_user_description           object
from_user_url                   object
from_user_profile_image_u

In [135]:
#Change the type of column "created_at" to datetime
twitter["created_at"] = twitter["created_at"].apply(pd.to_datetime)

The **first binary variable** that I want to create is the variable that can tell me whether this tweet is an original tweet or a retweet. The difference between these two kinds of posts is that the reposts start with "RT @", then follow with the username.

In [136]:
#I search for the twitter texts that start with "RT @". 
#Because I want to return "0" and "1", instead of "True" and "False", I set the column as integer.
twitter["retweet"] = twitter["text"].str.startswith("RT @").astype(int)
twitter["retweet"].value_counts()

0    2768
1    2353
Name: retweet, dtype: int64

In [137]:
twitter["retweet"].isna().sum()

0

From the result I see that in this Twitter dataset, there are 2353 retweets and 2768 original tweets. Every text is either an original post or a retweet, and there is no missing values in this variable. This is reasonable, because last time I found that there was no missing value in ```text``` column of this dataset.

I also want to explore how many tweets contain external links. Therefore, the **second binary variable** I want to create is a varaible that can tell me whether the tweet contains "http" or not.

In [138]:
#I search for the twitter texts that contain "http". 
twitter["ext_link"] = twitter["text"].str.contains("http").astype(int)
twitter["ext_link"].value_counts()

1    3587
0    1534
Name: ext_link, dtype: int64

I find that in this dataset, more than 70% of the tweets contain one or more external link! Perhaps this dataset was created by the keyword "penguin random house", which is a publisher that has its own official website, so users can contain the website address in their posts. Also, I guess that adding keywords like "e-book" or "audiobook" may include some tweets that share the link of e-books.

***

## Challenge 2

Merge the sentiment analysis results with your data. Make sure to check whether the length of the dataframe generated by the merge makes sense.


***

Before merging the dataframe, I need to select the tweets that are written in English, because in last week's challenge, I handed in a dataframe that only contains English tweets for sentiment analysis.

In [139]:
twitter = twitter[twitter["lang"] == "en"]

In [140]:
#Check the length of twitter dataframe
len(twitter)

4002

In [141]:
#Check the length of sentiment dataframe
len(sentiment)

4002

I find that the length of these two dataframes are the same, so I can now find the unique identifier and start to merge the dataframes.

In [142]:
#Check the columns of twitter dataframe
twitter.dtypes

id                                      int64
time                                    int64
created_at                     datetime64[ns]
from_user_name                         object
text                                   object
filter_level                           object
possibly_sensitive                    float64
withheld_copyright                    float64
withheld_scope                        float64
truncated                             float64
retweet_count                           int64
favorite_count                          int64
lang                                   object
to_user_name                           object
in_reply_to_status_id                 float64
quoted_status_id                      float64
source                                 object
location                               object
lat                                   float64
lng                                   float64
from_user_id                            int64
from_user_realname                

In [143]:
#Check the columns of sentiment dataframe
sentiment.dtypes

id          object
text        object
negative    object
positive    object
neutral     object
dtype: object

By comparing the columns of the two dataframes, I think the unique identifiers here can be ```id``` and ```text```. However, in ```sentiment```, ```id``` is in object form. I need to change it to numeric before merging the two tables.

In [144]:
#Change the type of column "id" to numeric
sentiment["id"] = sentiment["id"].apply(pd.to_numeric)

In [145]:
#Merge the dataframes
twitter = twitter.merge(sentiment, on=["id", "text"])

In [274]:
#Check the length of the merged table
len(twitter)

4002

***

## Challenge 3

The sentiment analysis results has three interesting columns: ```neutral```,  ```positive```, and ```negative```. It is coming from the SentiStrength (http://sentistrength.wlv.ac.uk/) algorithm, trinary version.

For this challenge, you need to:
1. Create one variable that summarizes the sentiment (i.e., that somehow aggregates the information of it being positive or negative - or potentially neutral - into one single variable)
2. Using the ```.groupby``` function, compare the means and standard deviations of that variable per category (that you created in Challenge 1).

*Tip: Pandas makes it easy to run numerical operations across columns. Let's say that I want to multiply the value that is in column A by the value that is in column B and store it in column C... I can simply use:*
```df['C'] = df['A'] * df['B']```


**Note:** if you cannot complete #1, make sure to at least complete #2 with each column separately. But do give it a try ;-)

***

### Challenge 3.1

Because the website of SentiStrength does not include a clear description of how the ```neutral``` column works in the analysis, I explore the dataframe a little bit and try to find the meaning of the scores in ```neutral``` column. After reading multiple rows, I find that by adding the values in ```positive``` and ```negative``` columns:  
* if the value for ```positive``` is 1 and that for ```negative``` is -1, the value in ```neutral``` column would be 0; 
* if the result is more than 0, the value in ```neutral``` column would be 1;
* if the result is less or equal than 0 (except from the situation that ```positive``` and ```negative``` are 1 and -1), the value in ```neutral``` column would be -1.  
After finding how ```neutral``` works, it would be easier for me to work on the new variable.

I tried different ways of generating the new variable:
* If I use ```positive * negative```, I cannot tell the difference between, for example, (-4,1) and (-1,4);
* If I use ```neutral(positive ** 2 - negative ** 2)``` or ```positive ** 3 + negative ** 3```, I cannot tell the difference between (-4,4) and (-2,2);
* If I use ```positive ** 3 * negative ** 3```, I cannot tell the difference between (-4,5) and (-5,4);
* If I use ```positive - negative ** 2```, I cannot tell the difference between (-1,1) and (-2,4)； 
* If I use ```neutral(positive ** 2 + negative ** 2)```, the result generated by this calculation is counter-intuitive. For example, (-1) * (-4 ** 2 + 1 ** 2) = -17, while (-1) * (-4 ** 2 + 2 ** 2) = -18. Intuitively, I would think that the post that has the value of -18 would be more negative, but actually it is more positive than the post that contains the value of -17. By using this calculation, I can tell the strength of the sentiment of the posts, but I cannot tell how strong the positive or negative sentiment is. Moreover, it is meaningless to count the mean on this value. 

Therefore, I come up with three ways to generate the variable, although I think none of which is perfect.
* **Method 1**: I can use ```positive + negative```. Although, by doing so, I cannot tell the difference between (-3,2) and (-2,1), sometimes this is acceptable regarding the usage of the data. If now I use this Twitter dataset to calculate what are people's general evaluations to ebook, and whether if they would recommand the audiobooks to others, I do not need to count the difference between (-3,2) and (-2,1), as I just want to know a result, which is how likely would the ebooks be recommanded. It seems that to users who generated the posts that contain the sentiment of (-3,2) and (-2,1), it would be more unlikely for them to recommand the ebooks, because their overall evaluation is negative-leaning.
* **Method 2**: I can assign a value manually to each of the combination. As the value for ```positive``` and ```negative``` are integers, I can combine them as ```neutral(positive.(-negative))```. In this way, the central point is 0, the sign before the value of the new variable (+ or -) shows whether the sentiment is positive or negative, and the change of the number shows how the different sentiments change. However, calculating mean and standard deviation on this value is meaningless.  
The difficulties about assigning number is that I am not sure about how can I use one number to compare the change of two sentiments, but the assigning process in the second method actually provides me a way to compare the different combinations.
* **Method 3**: I can reassign numbers to the result I get in Method 2. The numbers I get from Method 2 would range from -5.5 to 5.4, so I can assign -14 to 10 to different numbers. The process can be shown in the following table:

| Assign | Reassign |
| --- | --- |
| 5.4 | 10 |
| 5.3 | 9 |
| 5.2 | 8 |
| 5.1 | 7 |
| 4.3 | 6 |
| 4.2 | 5 |
| 4.1 | 4 |
| 3.2 | 3 |
| 3.1 | 2 |
| 2.1 | 1 |
| 0 | 0 |
| -1.2 | -1 |
| -1.3 | -2 |
| -1.4 | -3 |
| -1.5 | -4 |
| -2.2 | -5 |
| -2.3 | -6 |
| -2.4 | -7 |
| -2.5 | -8 |
| -3.3 | -9 |
| -3.4 | -10 |
| -3.5 | -11 |
| -4.4 | -12 |
| -4.5 | -13 |
| -5.5 | -14 |

In [208]:
#Turn the three sentiment related vatiables to numeric
twitter[["positive", "negative", "neutral"]] = twitter[["positive", "negative", "neutral"]].apply(pd.to_numeric)

In [263]:
#Method 1
twitter["med1"] = twitter["positive"] + twitter["negative"]

In [264]:
#Method 2
#Turn negative value to positive
twitter["neg"] = twitter["negative"] * (-1)
#Combine columns “neg” and “positive”
twitter["med2"] = twitter["positive"].astype(str) + "." + twitter["neg"].astype(str)
#Change the type of "med2"
twitter["med2"] = twitter["med2"].apply(pd.to_numeric)
#Multiply by neutral
twitter["med2"] = twitter["med2"] * twitter["neutral"]
#Delete the column "neg"
del twitter["neg"]

In [265]:
#Method 3
#Check the value in "med2"
twitter["med2"].value_counts()

 0.0    1332
 2.1     877
 3.1     404
 4.1     313
-1.2     291
-2.2     157
 3.2     136
-3.3      75
-2.3      60
-1.3      59
 4.2      58
-2.4      58
-3.4      48
-1.4      39
 5.1      31
 5.2      20
 4.3      16
-3.5       7
-2.5       6
-4.4       4
-4.5       4
-1.5       3
 5.3       2
 5.4       2
Name: med2, dtype: int64

In [266]:
#Reassign the numbers according to the table above
twitter["med3"] = twitter["med2"].replace({0.0:0, 2.1:1, 3.1:2, 4.1:4, -1.2:-1, -2.2:-5, 3.2:3, -3.3:-9, -2.3:-6, -1.3:-2, 4.2:5, -2.4:-7, -3.4:-10, -1.4:-3, 5.1:7, 5.2:8, 4.3:6, -3.5:-11, -2.5:-8, -4.4:-12, -4.5:-13, -1.5:-4, 5.3:9, 5.4:10})

### Challenge 3.2

**Method 1**: ```Positive + Negative```

In [269]:
twitter.groupby(["retweet"])["med1"].mean()

retweet
0    0.471549
1    0.685145
Name: med1, dtype: float64

In [270]:
twitter.groupby(["retweet"])["med1"].std()

retweet
0    1.224700
1    1.306298
Name: med1, dtype: float64

In [271]:
twitter.groupby(["ext_link"])["med1"].mean()

ext_link
0    0.510841
1    0.594946
Name: med1, dtype: float64

In [272]:
twitter.groupby(["ext_link"])["med1"].std()

ext_link
0    1.270702
1    1.265711
Name: med1, dtype: float64

From the calculation I find that, 
* for this Twitter dataset, the mean sentiment of retweets have more positive sentiment than the original posts on general. This may indicate that people are more likely to repost tweets that contain positive emotions. However, the mean sentiments for both the retweet group and the original post group are not high, and are more close to neutral;
* some retweets contain stronger positive and negative emotions, since the standard deviation for retweets are higher (though slightly);
* for the posts containing or not containing the external link, the mean sentiments are similar. To me, this result is reasonable, because the external link is a way to add information to tweets, not add emotions;
* the standard deviations for the two groups are also similar, indicating that the tweets in two categories both contain positive and negative emotions.

**Method 2**: ```Neutral * (Positive.(-Negative))```

In [243]:
twitter.groupby(["retweet"])["med2"].mean()

retweet
0    0.851679
1    1.070560
Name: med2, dtype: float64

In [244]:
twitter.groupby(["retweet"])["med2"].std()

retweet
0    1.906975
1    2.246714
Name: med2, dtype: float64

In [250]:
twitter.groupby(["ext_link"])["med2"].mean()

ext_link
0    0.942845
1    0.957529
Name: med2, dtype: float64

In [252]:
twitter.groupby(["ext_link"])["med2"].std()

ext_link
0    2.007664
1    2.100936
Name: med2, dtype: float64

**Method 3**: Reassignment

In [246]:
twitter.groupby(["retweet"])["med3"].mean()

retweet
0    0.190765
1    0.144241
Name: med3, dtype: float64

In [247]:
twitter.groupby(["retweet"])["med3"].std()

retweet
0    2.616129
1    3.468401
Name: med3, dtype: float64

In [253]:
twitter.groupby(["ext_link"])["med3"].mean()

ext_link
0    0.188205
1    0.161460
Name: med3, dtype: float64

In [254]:
twitter.groupby(["ext_link"])["med3"].std()

ext_link
0    2.780787
1    3.141095
Name: med3, dtype: float64

As stated above, the means and standard deviations of the overall sentiment calculated by using number assignment are meaningless, as through the process of assigning number manulally, the variables turn to categorical variables.