# Project - Tim Thoma (Matr.Nr. 75597)
## Generating Datasets with a Twitter API
This Project is divided in three main Jupyter Notebooks. The first Notebook ("01_project_NLP_Dataset") constains the overview of the Twitter API's and how to generate a Dataset with those tools. The second and the third Notebook ("02_project_NLP_nltk") and ("03_project_NLP_spacy") represent the main part of the NLP project. In these two notebooks, NLP techniques such as tokenization, lemmatization, stop word removal and polarity classification are applied to various generated Twitter records. In addition, the results of NLP techniques are graphically displayed.

### Introduction

Tweepy is a Python library for accessing a Twitter API. Tweepy is used in this project for accessing the Streaming API and the Search API. Those API's enable downloading Tweets and metadata from Twitter users. 

#### Streaming API
The Streaming API is generating real-time data from Twitter. You can choose from different parameters to limit your generated data. For example the location of Tweets. In this project the Streaming API was used to generate real-time Tweets from the German cities Munich, Frankfurt, Berlin and Cologne. Furthermore the data was collected over four hours. 

The first step after accessing the data set is to look at the data. The Tweets looked bad. There are lot of different languages, lot of colloquial language and lot of useless not related information. After a hours of trying to clean the data, it turned out that you can get no usefull information out of this data set. Therefore there will be no further analysis in this approach. Nevertheless the generated dataset is aviable. Maybe NLP tasks can be done on a larger dataset, but then a much bigger dataset has to be generated over days.

#### Search API
Twitter provides with the Search API the possibility to collect up to 3200 Tweets from one user. Against the background to generate a useful NLP data set, three famous Twitter personalities were selected. Donald Trump, Hillary Clinton and Bernie Sanders have an active Twitter account and are, because of the Presidential election in the United States in 2016, related to each other. These properties are very useful for analyzing and comparing Tweets against the background of NLP.


##### Important to know
While using the Twitter API's, there is a individual key required. This key will be send by E-Mail from Twitter and is for personal usage only. This key is hidden in the following code example. If someone want to run this code, to generate the data set, then a separate key must be used. But there is no use to run this code, because all of the generated data is provided in the Project folder.

# Search API
- access to a data set that already exists from tweets that have occurred
- max. the last 3.200 tweets
- 180 request in 15 min period

In [None]:
## Connect with your Twitter Development App

### Twitter API ###
import tweepy
import twitter

api = twitter.Api(consumer_key='insert_key_here',
                  consumer_secret='insert_key_here',
                  access_token_key='insert_key_here',
                  access_token_secret='insert_key_here')

In [None]:
#### Function for collecting Data with specific usernames ####

#number of tweets (remember: max.3200 tweets per user)

#define username
trump = "@realDonaldTrump"
clinton = "@HillaryClinton"
sanders = "@BernieSanders"

### uncomment for starting generating Dataset (takes long time)
#username = trump
###

## 
import csv
import sys
import tweepy
import numpy as np
import pandas as pd
##

# Set up
auth = tweepy.OAuthHandler('insert_key_here', 
                                'insert_key_here')# consumer_key, consumer_secret
auth.set_access_token('insert_key_here',
                          'insert_key_here') # access_key, access_key_secret
# Calling Search API via library tweepy
api = tweepy.API(auth)

#function for runtime error (only 180 tweets per 15 min)
def limit_handled(cursor):
    while True:
        try:
            try:
                yield cursor.next()
            except StopIteration:
                return
        except tweepy.RateLimitError:
            time.sleep(15 * 60)
            
#define what objects should be stored 

df = []
for tweet in limit_handled(tweepy.Cursor(api.user_timeline, screen_name=username, count=5000).items()):
    df.append([username, tweet.id_str, tweet.source, tweet.created_at,
               tweet.retweet_count, tweet.favorite_count, tweet.text.encode('utf-8')])
    
# write all in csv file / in df
filename= username + "_tweets.csv"
with open(filename, "w+") as file:
    writer = csv.writer(file, delimiter=',')
    writer.writerow(['User', 'Tweet_ID', 'Source', 'Created_at', 'Retweet_count', 'Favorite_count', 'Text'])
    writer.writerows(df)
    

### Take a look at the generatet Dataset

In [1]:
#Trump
import pandas as pd

df_trump = pd.read_csv('@realDonaldTrump_tweets.csv')
df_trump

Unnamed: 0,User,Tweet_ID,Source,Created_at,Retweet_count,Favorite_count,Text
0,@realDonaldTrump,1138445389709885445,Twitter for iPhone,2019-06-11 13:58:26,3622,14775,b'\xe2\x80\x9cWhy did the Democrats run if the...
1,@realDonaldTrump,1138444530020245505,Twitter for iPhone,2019-06-11 13:55:01,3892,16524,b'Good day in the Stock Market. People have no...
2,@realDonaldTrump,1138434259637690375,Twitter for iPhone,2019-06-11 13:14:13,4542,18706,b'....If Mexico produces (which I think they w...
3,@realDonaldTrump,1138433774327410688,Twitter for iPhone,2019-06-11 13:12:17,4247,18157,b'...Companies to come to the U.S.A and to get...
4,@realDonaldTrump,1138428346214363142,Twitter for iPhone,2019-06-11 12:50:43,6772,27314,"b'Maria, Dagan, Steve, Stuart V - When you are..."
5,@realDonaldTrump,1138427450927529984,Twitter for iPhone,2019-06-11 12:47:09,4149,14929,b'This is because the Euro and other currencie...
6,@realDonaldTrump,1138418208803831808,Twitter for iPhone,2019-06-11 12:10:26,7647,37786,"b'The United States has VERY LOW INFLATION, a ..."
7,@realDonaldTrump,1138414246298042369,Twitter for iPhone,2019-06-11 11:54:41,9714,36558,"b'Sad when you think about it, but Mexico righ..."
8,@realDonaldTrump,1138413497858101249,Twitter for iPhone,2019-06-11 11:51:43,10496,41680,b'PRESIDENTIAL HARASSMENT!'
9,@realDonaldTrump,1138412973649747968,Twitter for iPhone,2019-06-11 11:49:38,7470,31664,b'....investigation in the Senate. They can ta...


In [None]:
#Clinton
import pandas as pd

df_clinton = pd.read_csv('@HillaryClinton_tweets.csv')
df_clinton

In [None]:
#Sanders
import pandas as pd

df_sanders = pd.read_csv('@BernieSanders_tweets.csv')
df_sanders

## Main task - NLP

These three generated Twitter datasets are in the two following Jupyter Notebooks used. The datas are analysed by two different popular NLP packages **NLTK** and **Spacy**. The goal by using two different packages is to have a benchmark possibility. Furthermore the outcome of those two packages, i.e. the cleaned data will be **visualized** and in addition analysed against the background of **sentiment analysis**.

Link to [Notebook 2 - NLTK](02_project_NLP_nltk.ipynb)

Link to [Notebook 3 - Spacy](03_project_NLP_spacy.ipynb)

#### Credits:
https://developer.twitter.com/en/docs.html

https://spacy.io

https://www.nltk.org

https://textblob.readthedocs.io/en/dev/

https://github.com/cjhutto/vaderSentiment