# Labeling test tweets

This notebook I´m going to create a test data set.
Therefore, I will use the first 200 tweets of the data set and run the Watson Natural Language Understanding API to label the 200 tweets.

- **Input**:
Input is going to be the data frame of the csv file. 

- **Output**:
username, date, retweets, favorites, text, **emotion**

# Emotino classification function

In [263]:
# function to classify text (tweet, one tweet)
def emotion_classification(Text):
    """
    This function runs the Watson Natural Language Understanding API
    
    Input:
    - Text, which is going to be classifed into specific emotion
    
    Output:
    - list
    -- most important - emotion of the document
    """
    response = natural_language_understanding.analyze(
        text= Text, # inpute of the function
        features=Features(
            emotion=EmotionOptions(document=True),
            entities=EntitiesOptions(emotion=True, sentiment=True, limit=2),
            keywords=KeywordsOptions(emotion=True, sentiment=True,
                                     limit=2))
    ).get_result()
    return(response)

# get max emotion

In [248]:
def get_max_emotion(emotions_dict):
    
    # find max value and label
    max_probability = max(emotions_dict.values())  

    # getting all emotions containing the "max_probability" as string
    emotion_max_probability = [k for k, v in emotions_dict.items() if v == max_probability][0] 
    
    # return
    return(emotion_max_probability,max_probability)

# drop rows with pictures and links at the beginning of the tweet

In [359]:
def drop_pictures_links(df):
    """
    This function does removes the tweets (whole row), which start with a picture or link.
    Reason: Often it´s just the picture or the link, this causes a problem by analyzing the "text"
    
    Input:
    - df: the data frame with all unprocessed data
    
    Output:
    - df: without the rows which start with a picture or link
    - drop_list: list of the row index which got dropped
    """
    # init list of all the rows which are going to be dropped
    drop_list = []
    # init values for the loop
    # check if tweets starts with picture or link
    start_picture_tweet = "pic.twitter.com"
    start_picture_tweet_length = len(start_picture_tweet)
    start_link_tweet = "https://"
    start_link_tweet_length = len(start_link_tweet)
    
    # run throguh every row and check which rows start with a picture or link
    for i in range(df.shape[0]):
        # check picture
        if df.text[i][0:start_picture_tweet_length] == start_picture_tweet:
            drop_list.append(i)
        # check for link
        if df.text[i][0:start_link_tweet_length] == start_link_tweet:
            drop_list.append(i)
     
    # drop rows which start with pic or link
    df_new = df.drop(drop_list)
    
    # return list of rows to drop
    return(df_new, drop_list)

# Import of the libraries

In [524]:
# libraries Im going to use
import pandas as pd
from IPython.display import Image
import os

In [402]:
# load the tweets
data = pd.read_csv("trumptweets.csv", sep=';')

In [403]:
data.head()

Unnamed: 0,username,date,retweets,favorites,text,geo,mentions,hashtags,id,permalink
0,realDonaldTrump,09.02.20 00:47,13459,72445,A great coach and a fantastic guy. His endorse...,,,,"1,22629E+18",https://twitter.com/realDonaldTrump/status/122...
1,realDonaldTrump,08.02.20 22:08,47880,215503,Pete Rose played Major League Baseball for 24 ...,,,,"1,22625E+18",https://twitter.com/realDonaldTrump/status/122...
2,realDonaldTrump,08.02.20 20:48,9452,37402,Total and complete Endorsement for Debbie Lesk...,,#NAME?,,"1,22623E+18",https://twitter.com/realDonaldTrump/status/122...
3,realDonaldTrump,08.02.20 20:40,17545,62484,Governor Cuomo wanted to see me this weekend. ...,,,,"1,22623E+18",https://twitter.com/realDonaldTrump/status/122...
4,realDonaldTrump,08.02.20 20:01,27437,120598,We will not be touching your Social Security o...,,,,"1,22622E+18",https://twitter.com/realDonaldTrump/status/122...


### Data frame relevant information

In [503]:
# data frame just with relevant informantion
#data_test = data.iloc[:,0:5]

# create a test set of 200 tweets
data_test = data.iloc[0:100,:]
data_test

Unnamed: 0,username,date,retweets,favorites,text,geo,mentions,hashtags,id,permalink
0,realDonaldTrump,09.02.20 00:47,13459,72445,A great coach and a fantastic guy. His endorse...,,,,"1,22629E+18",https://twitter.com/realDonaldTrump/status/122...
1,realDonaldTrump,08.02.20 22:08,47880,215503,Pete Rose played Major League Baseball for 24 ...,,,,"1,22625E+18",https://twitter.com/realDonaldTrump/status/122...
2,realDonaldTrump,08.02.20 20:48,9452,37402,Total and complete Endorsement for Debbie Lesk...,,#NAME?,,"1,22623E+18",https://twitter.com/realDonaldTrump/status/122...
3,realDonaldTrump,08.02.20 20:40,17545,62484,Governor Cuomo wanted to see me this weekend. ...,,,,"1,22623E+18",https://twitter.com/realDonaldTrump/status/122...
4,realDonaldTrump,08.02.20 20:01,27437,120598,We will not be touching your Social Security o...,,,,"1,22622E+18",https://twitter.com/realDonaldTrump/status/122...
...,...,...,...,...,...,...,...,...,...,...
95,realDonaldTrump,31.01.20 03:54,24418,100133,"This November, we are going to defeat the Radi...",,,#KAG2020,"1,22308E+18",https://twitter.com/realDonaldTrump/status/122...
96,realDonaldTrump,31.01.20 03:48,19539,88388,"Thank you Iowa, I love you! https://www. pscp....",,,,"1,22308E+18",https://twitter.com/realDonaldTrump/status/122...
97,realDonaldTrump,31.01.20 01:19,18878,73498,"Great poll in Iowa, where I just landed for a ...",,,#KAG2020,"1,22304E+18",https://twitter.com/realDonaldTrump/status/122...
98,realDonaldTrump,30.01.20 23:04,22964,138237,Working closely with China and others on Coron...,,,,"1,223E+18",https://twitter.com/realDonaldTrump/status/122...


# Create data frame with emotion function

In [520]:
def create_emotion(df):
    """
    The function does transform the data frame, so that just important information are going to be used. 
    It also adds the emotion and probability of the emotion as column.
    
    Input:
    - df: this is going to be the data frame as we get it from the GitHub function
    
    Output:
    - df_new: data frame we important information, plus enmotion and emotion probability
    """
    
    # tranformation of the data frame
    # data frame just with relevant informantion (username, date, retweets, favorites)
    trump_tweets = df.iloc[:,0:5]

    # drop rows of the data frame which start with picture or link
    trump_tweets, rowindex_dropped = drop_pictures_links(trump_tweets) # --- function
    # reindex data frame - easier to run the loop
    trump_tweets = trump_tweets.reset_index()
    # init emotion and its probabilites as list - save the results  - going to be return values
    emotions_list = []
    emotions_prob_list = []
    
    # loop over every sentence in the data frame
    for i in range(trump_tweets.shape[0]):

        # run the function with Watson API 
        emotions = emotion_classification(trump_tweets.text[i]) # --- function

        ### --- flag - start ---
        # check if there does a emotion exist
        # else classify with "emotionless, 0,0%"
        if("emotion" in emotions.keys()):
        ### --- flag - end ---

            # access just the emotion of the document (of the tweet)
            emotions_dict = emotions["emotion"]["document"]["emotion"]

            # get the max emotion value
            # get_max_emotion(emotions_dict) 
            emotion_max_probability,max_probability = get_max_emotion(emotions_dict) # --- function 

            # save results for each step
            emotions_list.append(emotion_max_probability)
            emotions_prob_list.append(max_probability)

        # classify emotion with "emotionsless"
        else:
            emotions_list.append("emotionless")
            emotions_prob_list.append(0.0)
            
    # assign emotions and its probabilites to data frame
    trump_tweets = trump_tweets.assign(emotion = emotions_list)
    trump_tweets = trump_tweets.assign(emotion_probability = emotions_prob_list)
    
    # return new data frame
    return(trump_tweets)

### Watson API

In [11]:
# libraries I need to import
import json

# if import does not work
# pip install --upgrade "ibm-watson>=4.2.1"
from ibm_watson import NaturalLanguageUnderstandingV1
from ibm_cloud_sdk_core.authenticators import IAMAuthenticator
from ibm_watson.natural_language_understanding_v1 import Features, EntitiesOptions, KeywordsOptions, EmotionOptions

**API setup**

In [12]:
# API setup
# I saved all my important data into a json file
# this file I open here to save the apikey and url in a variable
# since other people should not see my keys
with open('watson_api.json') as json_file:
    # save data in dict
    api_access = json.load(json_file)

# init variables neeeded 
apikey = api_access["apikey"]
url = api_access["url"]

**API setting**

In [13]:
# settings for the api 
authenticator = IAMAuthenticator(apikey)
natural_language_understanding = NaturalLanguageUnderstandingV1(
    version='2019-07-12',
    authenticator=authenticator)

natural_language_understanding.set_service_url(url)

**TEST**

In [518]:
# create a test set of 200 tweets
data_test = data.iloc[0:100,:]

In [521]:
# create data frame 
data_new = create_emotion(data_test)

In [522]:
# check the data frame
data_new

Unnamed: 0,index,username,date,retweets,favorites,text,emotion,emotion_probability
0,0,realDonaldTrump,09.02.20 00:47,13459,72445,A great coach and a fantastic guy. His endorse...,joy,0.601324
1,1,realDonaldTrump,08.02.20 22:08,47880,215503,Pete Rose played Major League Baseball for 24 ...,sadness,0.320811
2,2,realDonaldTrump,08.02.20 20:48,9452,37402,Total and complete Endorsement for Debbie Lesk...,joy,0.453622
3,3,realDonaldTrump,08.02.20 20:40,17545,62484,Governor Cuomo wanted to see me this weekend. ...,sadness,0.462204
4,4,realDonaldTrump,08.02.20 20:01,27437,120598,We will not be touching your Social Security o...,anger,0.648076
...,...,...,...,...,...,...,...,...
86,95,realDonaldTrump,31.01.20 03:54,24418,100133,"This November, we are going to defeat the Radi...",joy,0.661953
87,96,realDonaldTrump,31.01.20 03:48,19539,88388,"Thank you Iowa, I love you! https://www. pscp....",joy,0.809158
88,97,realDonaldTrump,31.01.20 01:19,18878,73498,"Great poll in Iowa, where I just landed for a ...",joy,0.584979
89,98,realDonaldTrump,30.01.20 23:04,22964,138237,Working closely with China and others on Coron...,joy,0.628873


In [532]:
# save the file in the current working directory
data_new.to_csv('LabeledTweets.csv', sep=';')

## Start - check error message

In [285]:
# since the systems has a problem to produce a test set with the via the first 100 rows
# Im going to check the frist 100 to figure out what the problem could be
for i in range(100):
    print("---")
    print(i)
    print(trump_tweets.text[i])

---
0
A great coach and a fantastic guy. His endorsement of me in Indiana was a very big deal! https:// twitter.com/kyle__boone/st atus/1226234981808250880 …
---
1
Pete Rose played Major League Baseball for 24 seasons, from 1963-1986, and had more hits, 4,256, than any other player (by a wide margin). He gambled, but only on his own team winning, and paid a decades long price. GET PETE ROSE INTO THE BASEBALL HALL OF FAME. It’s Time!
---
2
Total and complete Endorsement for Debbie Lesko! @RepDLesko Love Arizona. https:// twitter.com/repdlesko/stat us/1225484090754899969 …
---
3
Governor Cuomo wanted to see me this weekend. He just canceled. Very hard to work with New York - So stupid. All they do is sue me all the time! https:// twitter.com/RepStefanik/st atus/1225494053913079808 …
---
4
We will not be touching your Social Security or Medicare in Fiscal 2021 Budget. Only the Democrats will destroy them by destroying our Country’s greatest ever Economy!
---
5
...the worst weeks ever.” Sh

In [295]:
test_text13 = "pic.twitter.com/gsZyUFlTPJ"
test_text12 = "Not good! https:// twitter.com/Cernovich/stat us/1225971069325955074 …"
test_text38 ="https://www. pscp.tv/w/cQxYyDFvTlFs TFJub1dwUXd8MWxQSnFWTm5yRW14YklFJuJ_tJ0v4Udv_WOByjdcO6UdLv7wH_KJbs45m-wZ?t=7s …"
test_text64 = "@LindseyGrahamSC https:// twitter.com/michaelbeatty3 /status/1224003122055372800 …"


In [289]:
# test first test text
emotions_test = emotion_classification(test_text13)

ERROR:root:unsupported text language: pl
Traceback (most recent call last):
  File "/Users/phillipholscher/opt/anaconda3/lib/python3.7/site-packages/ibm_cloud_sdk_core/base_service.py", line 229, in send
    response.status_code, error_message, http_response=response)
ibm_cloud_sdk_core.api_exception.ApiException: Error: unsupported text language: pl, Code: 400 , X-global-transaction-id: 2699e6705aeb9e6e618afdc8926883e3


ApiException: Error: unsupported text language: pl, Code: 400 , X-global-transaction-id: 2699e6705aeb9e6e618afdc8926883e3

In [291]:
# test second test text
emotions_test2 = emotion_classification(test_text12)

In [292]:
emotions_test2

{'usage': {'text_units': 1, 'text_characters': 70, 'features': 3},
 'language': 'en',
 'keywords': [{'text': 'https',
   'sentiment': {'score': 0, 'label': 'neutral'},
   'relevance': 0.916021,
   'emotion': {'sadness': 0.201746,
    'joy': 0.053311,
    'fear': 0.137107,
    'disgust': 0.330373,
    'anger': 0.038904},
   'count': 1},
  {'text': 'stat',
   'sentiment': {'score': 0, 'label': 'neutral'},
   'relevance': 0.71226,
   'emotion': {'sadness': 0.201746,
    'joy': 0.053311,
    'fear': 0.137107,
    'disgust': 0.330373,
    'anger': 0.038904},
   'count': 1}],
 'entities': [],
 'emotion': {'document': {'emotion': {'sadness': 0.153904,
    'joy': 0.079845,
    'fear': 0.074985,
    'disgust': 0.024289,
    'anger': 0.150603}}}}

In [294]:
# test third test text
emotions_test3 = emotion_classification(test_text38)

ERROR:root:unsupported text language: unknown
Traceback (most recent call last):
  File "/Users/phillipholscher/opt/anaconda3/lib/python3.7/site-packages/ibm_cloud_sdk_core/base_service.py", line 229, in send
    response.status_code, error_message, http_response=response)
ibm_cloud_sdk_core.api_exception.ApiException: Error: unsupported text language: unknown, Code: 400 , X-global-transaction-id: 28cd565dffff2a60519649983ccf67ae


ApiException: Error: unsupported text language: unknown, Code: 400 , X-global-transaction-id: 28cd565dffff2a60519649983ccf67ae

In [296]:
# test fours test text
emotions_test4 = emotion_classification(test_text64)

In [297]:
emotions_test4

{'usage': {'text_units': 1, 'text_characters': 82, 'features': 3},
 'language': 'en',
 'keywords': [{'text': 'https',
   'sentiment': {'score': 0, 'label': 'neutral'},
   'relevance': 0.937512,
   'emotion': {'sadness': 0.113138,
    'joy': 0.02563,
    'fear': 0.123733,
    'disgust': 0.280839,
    'anger': 0.037328},
   'count': 1},
  {'text': 'status',
   'sentiment': {'score': 0, 'label': 'neutral'},
   'relevance': 0.690082,
   'emotion': {'sadness': 0.113138,
    'joy': 0.02563,
    'fear': 0.123733,
    'disgust': 0.280839,
    'anger': 0.037328},
   'count': 1}],
 'entities': [{'type': 'TwitterHandle',
   'text': '@LindseyGrahamSC',
   'sentiment': {'score': 0, 'label': 'neutral'},
   'relevance': 0.978348,
   'emotion': {'sadness': 0.113138,
    'joy': 0.02563,
    'fear': 0.123733,
    'disgust': 0.280839,
    'anger': 0.037328},
   'count': 1,
   'confidence': 0.8}],
 'emotion': {'document': {'emotion': {'sadness': 0.113138,
    'joy': 0.02563,
    'fear': 0.123733,
    'dis

# The problem

- **Problem**: Just picture or link posts

If there is a **post of** just a **picture or** a **link**, there IBM Watson API does not understand the "language" and therefore does have a problem.

- **Solution**: I´m going to drop all the tweets in the data frame which are just pictures or links! 

Ignoring this tweets is one way to solve the problem. 
Since this project only deals with text analysis it is okay in this context. 
However, it would also be possible to classify these tweets into an emotion, but this increases the complexity of the project.

In [314]:
# Im going to drop all the tweets from the data frame which start with a picture or link

# create a test set of 200 tweets
data_test = trump_tweets.iloc[0:100,:]
data_test

Unnamed: 0,username,date,retweets,favorites,text
0,realDonaldTrump,09.02.20 00:47,13459,72445,A great coach and a fantastic guy. His endorse...
1,realDonaldTrump,08.02.20 22:08,47880,215503,Pete Rose played Major League Baseball for 24 ...
2,realDonaldTrump,08.02.20 20:48,9452,37402,Total and complete Endorsement for Debbie Lesk...
3,realDonaldTrump,08.02.20 20:40,17545,62484,Governor Cuomo wanted to see me this weekend. ...
4,realDonaldTrump,08.02.20 20:01,27437,120598,We will not be touching your Social Security o...
...,...,...,...,...,...
95,realDonaldTrump,31.01.20 03:54,24418,100133,"This November, we are going to defeat the Radi..."
96,realDonaldTrump,31.01.20 03:48,19539,88388,"Thank you Iowa, I love you! https://www. pscp...."
97,realDonaldTrump,31.01.20 01:19,18878,73498,"Great poll in Iowa, where I just landed for a ..."
98,realDonaldTrump,30.01.20 23:04,22964,138237,Working closely with China and others on Coron...


In [315]:
data_test_new = drop_pictures_links(data_test)
data_test_new

Unnamed: 0,username,date,retweets,favorites,text
0,realDonaldTrump,09.02.20 00:47,13459,72445,A great coach and a fantastic guy. His endorse...
1,realDonaldTrump,08.02.20 22:08,47880,215503,Pete Rose played Major League Baseball for 24 ...
2,realDonaldTrump,08.02.20 20:48,9452,37402,Total and complete Endorsement for Debbie Lesk...
3,realDonaldTrump,08.02.20 20:40,17545,62484,Governor Cuomo wanted to see me this weekend. ...
4,realDonaldTrump,08.02.20 20:01,27437,120598,We will not be touching your Social Security o...
...,...,...,...,...,...
95,realDonaldTrump,31.01.20 03:54,24418,100133,"This November, we are going to defeat the Radi..."
96,realDonaldTrump,31.01.20 03:48,19539,88388,"Thank you Iowa, I love you! https://www. pscp...."
97,realDonaldTrump,31.01.20 01:19,18878,73498,"Great poll in Iowa, where I just landed for a ..."
98,realDonaldTrump,30.01.20 23:04,22964,138237,Working closely with China and others on Coron...


As we can see does the numbers 13 and 38, these numbers I tested I failed the NLU run, are on the list. 
**Remove these rows!**

In [310]:
data_test.drop(drop_list)

Unnamed: 0,username,date,retweets,favorites,text
0,realDonaldTrump,09.02.20 00:47,13459,72445,A great coach and a fantastic guy. His endorse...
1,realDonaldTrump,08.02.20 22:08,47880,215503,Pete Rose played Major League Baseball for 24 ...
2,realDonaldTrump,08.02.20 20:48,9452,37402,Total and complete Endorsement for Debbie Lesk...
3,realDonaldTrump,08.02.20 20:40,17545,62484,Governor Cuomo wanted to see me this weekend. ...
4,realDonaldTrump,08.02.20 20:01,27437,120598,We will not be touching your Social Security o...
...,...,...,...,...,...
95,realDonaldTrump,31.01.20 03:54,24418,100133,"This November, we are going to defeat the Radi..."
96,realDonaldTrump,31.01.20 03:48,19539,88388,"Thank you Iowa, I love you! https://www. pscp...."
97,realDonaldTrump,31.01.20 01:19,18878,73498,"Great poll in Iowa, where I just landed for a ..."
98,realDonaldTrump,30.01.20 23:04,22964,138237,Working closely with China and others on Coron...


## End - check error message