## Progress Report 1

This notebook displays stats on a limited set of twitter data in the form of JSON files. I clean the dataset and place relevant entries into a pandas DataFrame. All data is taken from ArchiveTeam's [ongoing twitter stream](https://archive.org/search.php?query=twitterstream&sort=-publicdate&page=2) under CC0. 

### Set Up

This portion of code loads all neccessary libraries and formatting things like pretty print or the interactive shell, which lets me print multiple outputs per cell.

In [1]:
# Libraries
import tweepy
import pandas as pd
import matplotlib.pyplot as plt
import json
import os
import re
from pandas.io.json import json_normalize

In [2]:
# Formatting
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
%pprint

Pretty printing has been turned OFF


In [14]:
# Creating keys and auth
consumerKey = 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'
consumerSecret = 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'
access_token = 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'
access_secret = 'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'

auth = tweepy.OAuthHandler(consumer_key=consumerKey, 
    consumer_secret=consumerSecret)

api = tweepy.API(auth)

### Importing Data

This section imports the relevant data. This data is from 2011 around September.

In [4]:
# lists to store file snippets
d = [] # dir snippet
fn = [] # full file name snippets
i = 0 # iter

# populates the list with directory names
for root, dirs, files in os.walk(r'D:\Documents\Classes\Spring2020\ling1340\Twitter-Positivity-Analysis\data\27'):
    for filename in dirs:
        d.append(filename)

In [5]:
d

['19', '20', '21', '22', '23']

These numerical values are the subfolders of my data file that are holding the JSON files. 

In [6]:
# This loop cycles through each directory (appending it to the file name) and populates the fn array with the filenames  
while True:
    fnd = r'D:\Documents\Classes\Spring2020\ling1340\Twitter-Positivity-Analysis\data\27' + "\\" + d[i]
    for root, dirs, files in os.walk(fnd):
        for filename in files:
            fn.append(d[i] + "\\" + filename) 
            if "bz2" in filename is not True: #there are zip files present in the folder, I want to remove them
                fn.remove(d[i] + "\\" + filename)
    if i >= len(d)-1:
        break
    i = i+1

In [7]:
fn[:10]

['19\\48.json', '20\\35.json', '20\\36.json', '20\\37.json', '20\\38.json', '20\\39.json', '20\\40.json', '20\\41.json', '20\\42.json', '20\\43.json']

Building the fn array to have the directory+filename combo will make it easier to loop through and read all the files later.

### Cleaning Data

This section strips the data of extraneous tags and emoticons so that it is easier to tokenize later. 

In [8]:
emoticons_str = r"""
    (?:
        [:=;] 
        [oO\-]? 
        [D\)\]\(\]/\\OpP] 
    )"""
 
regex_str = [
    emoticons_str,
    r'<[^>]+>', # HTML tags
    r'(?:@[\w_]+)', # @-mentions
    r"(?:\#+[\w_]+[\w\'_\-]*[\w_]+)", # hash-tags
    r'http[s]?://(?:[a-z]|[0-9]|[$-_@.&amp;+]|[!*\(\),]|(?:%[0-9a-f][0-9a-f]))+', # URLs
 
    r'(?:(?:\d+,?)+(?:\.?\d+)?)', # numbers
    r"(?:[a-z][a-z'\-_]+[a-z])", # words with - and '
    r'(?:[\w_]+)', # other words
    r'(?:\S)' # anything else
]
    
tokens_re = re.compile(r'('+'|'.join(regex_str)+')', re.VERBOSE | re.IGNORECASE)
emoticon_re = re.compile(r'^'+emoticons_str+'$', re.VERBOSE | re.IGNORECASE)

In [9]:
# methods to clean text
def tokenize(s):
    return tokens_re.findall(s)
 
def preprocess(s, lowercase=False):
    tokens = tokenize(s)
    if lowercase:
        tokens = [token if emoticon_re.search(token) else token.lower() for token in tokens]
    return tokens

### Organizing Data 

This section places the data into DataFrames for further analysis. It also trims those DF's to only include relevant information such as text body, date, and language. 

In [10]:
i = 0 # iter
df = pd.DataFrame() # empty dataframe to be appended to

# using the fn array we created here
while True:
    file = r'D:\Documents\Classes\Spring2020\ling1340\Twitter-Positivity-Analysis\data\27' + "\\" + fn[i]
    with open(file, 'r') as f:
        for line in f:
            tweet = json.loads(line) # loads each chunk of json code, which is a tweet
            df = df.append(json_normalize(tweet), sort=False, ignore_index=True) # normalizes the previous tweet 
                                                                                 # into a data frame and appends that to df
    if i > 3: # len(fn)-1: <-- this is what I would use for the full corpus, but I have limited it.
        break
    i = i + 1

I have only parsed through a sample of the total data because the full amount takes too long to parse during each run. 

Here is the unedited DataFrame:

In [11]:
df
len(df)

Unnamed: 0,in_reply_to_status_id,text,in_reply_to_screen_name,truncated,retweeted,in_reply_to_status_id_str,source,created_at,in_reply_to_user_id_str,geo,...,geo.type,geo.coordinates,coordinates.type,coordinates.coordinates,place.bounding_box,retweeted_status.geo.type,retweeted_status.geo.coordinates,retweeted_status.coordinates.type,retweeted_status.coordinates.coordinates,retweeted_status.entities.media
0,,"@firawidya iya fir,hihi. Rencananya awal bulan...",firawidya,False,False,,"<a href=""http://ubersocial.com"" rel=""nofollow""...",Wed Sep 28 01:48:17 +0000 2011,60311908,,...,,,,,,,,,,
1,,"voo sair aki sem maldade, to com mt sono ; #fato",,False,False,,web,Wed Sep 28 01:48:17 +0000 2011,,,...,,,,,,,,,,
2,,"RT @TheNoteboook: If he loves you, he'll care ...",,True,False,,"<a href=""http://twitter.com/#!/download/iphone...",Wed Sep 28 01:48:17 +0000 2011,,,...,,,,,,,,,,
3,,Ombbb @ddlavato singing on abc!,,False,False,,"<a href=""http://twitter.com/devices"" rel=""nofo...",Wed Sep 28 01:48:17 +0000 2011,,,...,,,,,,,,,,
4,,RT @aniserra: Nunca voy a entender como sabe e...,,False,False,,web,Wed Sep 28 01:48:15 +0000 2011,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3691,118877014584328193,@Agosbonelli a full!!! Después empiezan las ve...,Agosbonelli,False,False,118877014584328193,"<a href=""http://blackberry.com/twitter"" rel=""n...",Wed Sep 28 02:38:59 +0000 2011,255577123,,...,,,,,,,,,,
3692,118876461015900162,"@hay_ross ahaha, so am I, but we have our reas...",hay_ross,False,False,118876461015900162,"<a href=""http://twitter.com/download/android"" ...",Wed Sep 28 02:38:59 +0000 2011,325945966,,...,,,,,,,,,,
3693,118875695173734400,@densukefukusima つまり、プロローグを消し、回想シーンを冒頭にすることで一工...,densukefukusima,False,False,118875695173734400,web,Wed Sep 28 02:38:59 +0000 2011,87141240,,...,,,,,,,,,,
3694,,«@Socialite_Trina ♪ Rolex/ More sex/good weed/...,,False,False,,"<a href=""http://levelupstudio.com"" rel=""nofoll...",Wed Sep 28 02:38:59 +0000 2011,,,...,,,,,,,,,,


3696

I'm happy to see replies and interactions with other people, because I think that kind of data will speak to how people treated each other online (which can sometimes be where nastiness comes through), but I can see some problems.

As you can see there is a lot of extraneous information represented here. For the sake of my project, which is doing a diachronic sentiment analysis on the content of tweets, the bare bones of what I need to look at is text and date. 

In [12]:
df = df[['text','user.lang', 'created_at']]
df
len(df)
df['user.lang'].value_counts()

Unnamed: 0,text,user.lang,created_at
0,"@firawidya iya fir,hihi. Rencananya awal bulan...",en,Wed Sep 28 01:48:17 +0000 2011
1,"voo sair aki sem maldade, to com mt sono ; #fato",pt,Wed Sep 28 01:48:17 +0000 2011
2,"RT @TheNoteboook: If he loves you, he'll care ...",en,Wed Sep 28 01:48:17 +0000 2011
3,Ombbb @ddlavato singing on abc!,en,Wed Sep 28 01:48:17 +0000 2011
4,RT @aniserra: Nunca voy a entender como sabe e...,es,Wed Sep 28 01:48:15 +0000 2011
...,...,...,...
3691,@Agosbonelli a full!!! Después empiezan las ve...,es,Wed Sep 28 02:38:59 +0000 2011
3692,"@hay_ross ahaha, so am I, but we have our reas...",en,Wed Sep 28 02:38:59 +0000 2011
3693,@densukefukusima つまり、プロローグを消し、回想シーンを冒頭にすることで一工...,ja,Wed Sep 28 02:38:59 +0000 2011
3694,«@Socialite_Trina ♪ Rolex/ More sex/good weed/...,en,Wed Sep 28 02:38:59 +0000 2011


3696

en       2532
es        512
ja        303
pt        302
ko         28
id          9
fr          5
ru          2
de          1
zh-cn       1
nl          1
Name: user.lang, dtype: int64

Now the data is restricted to 3 columns. I also flashed the counts of the langauges present.

Next, I tried to restrict the language to English, or "en"...

In [13]:
df[df['user.lang'] == 'en']
len(df[df['user.lang'] == 'en'])

Unnamed: 0,text,user.lang,created_at
0,"@firawidya iya fir,hihi. Rencananya awal bulan...",en,Wed Sep 28 01:48:17 +0000 2011
2,"RT @TheNoteboook: If he loves you, he'll care ...",en,Wed Sep 28 01:48:17 +0000 2011
3,Ombbb @ddlavato singing on abc!,en,Wed Sep 28 01:48:17 +0000 2011
6,Não reclamo da vida. É bem provável q ela me p...,en,Wed Sep 28 01:48:17 +0000 2011
7,No. Don't tell me my flaws. I clearly know the...,en,Wed Sep 28 01:48:18 +0000 2011
...,...,...,...
3689,Gimana cara megangnya? :D RT @bangsaaat: bener...,en,Wed Sep 28 02:38:59 +0000 2011
3690,"@SmokedOutEricaa awww, Im sorry to hear that b...",en,Wed Sep 28 02:38:59 +0000 2011
3692,"@hay_ross ahaha, so am I, but we have our reas...",en,Wed Sep 28 02:38:59 +0000 2011
3694,«@Socialite_Trina ♪ Rolex/ More sex/good weed/...,en,Wed Sep 28 02:38:59 +0000 2011


2532

However, as you can see there are still some languages other than English present (specifically entries 1 and 6). This is because the "languages" tag referred to *user* language, not the language of the tweet. 

That leads me to...

### Future Plans

I plan to use a language identification library to see if I can figure out the language in a tweet and filter them out that way. 

Additionally, I want to convert the time data into something more simple, like maybe just the year. 