# Twitter Munging & csv Conversion Program

### Dependencies (refer to the next cell)

### 1. Load, Merge, Sentiment Analysis, Save
 - Load JSON files to be processed (they will appear as lists of dictionaries)
 - Merge lists of tweets into single `mergedlist`
 - Loop through the list, performing sentiment analysis on `tweet['text']`
 - Local pprint a single unedited dictionary for proofing purposes
 - Save the `mergedlist` into a json file

### 2. Munge into Custom DataFrame -> csv
 - Loop through `mergedlist` to extract specific json elements to be included in final DataFrame
 - Looping might not be the most desirable process, however there were numerous `try/except` sub processes that were difficult to incorporate into list comprehension.
 - Local print the top 3 rows of the `mungedDF` for inspection
 - Save `mungedDF` as a csv
 
### 3. Partially Flattened JSON -> pd.DF -> csv
 - This process creates a large csv file.
 - Where none of the data is removed from JSON -> csv, the final csv has numerous columns with nested data making it difficult to use.
 
### 4. Flattened JSON -> pd.DF -> csv
  - This process creates an even larger csv file!
  - This process attempts to flatten every nested key-value within the json file.
  - Flattened json tweets can generate more than 360 columns.

### Important Variables Used Throughout:
| Variable Name | Description |
|---------------|:-----------:|
| mergedlist    | Merged list of JSON Tweets |
| LongList      | Temp. list of dict before mungedDF |
| mungedDF      | DataFrame created in Section 2 |
| Part_Flat_DF  | DataFrame created in Section 3 |
| Flat_DF       | DataFrame created in Section 4 |

### Legend of other tweet variables
 - This entire section is commented out and not intended to be run.
 - The purpose of this section is to provide the syntax for extracting different variables from full tweet returns so you can customize data munging.

In [1]:
# Dependencies

import json
import numpy as np
import pandas as pd
from pandas import DataFrame as df
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()
from pprint import pprint
from pandas.io.json import json_normalize


# 1. Load and Merge JSON Files



---------------------------------------------------------------------------------------------------------------------
## Sections:

#### 1.1 (Load & Merge)
 - Load json files (lists of nested dictionaries)
 - Print length of each json file loaded (how many dictionaries in each list)
 - Merge each of the lists into `mergedlist`
 - Print length of `mergedlist`

#### 1.2 (Sentiment Analysis)
 - Analyze each tweets text (the property `tweet['text']`)
 - Add to each tweet response a negative, neutral, positive, and compound sentiment analysis score

#### 1.3 (Proof Single Tweet)
 - Print a single unedited json return

#### 1.4 (Save)
 - Save `mergedlist` as a new json file


In [3]:
# 1.1
# Local Folder Paths
# WomensWave/<file name>
# WhyIMarch/<file name>
# WomensMarch_tag <file name>

# Load json file - use `with open ('file name.json') as <temp variable, can be anything>`
## `<variable, can be anything> = json.load(temp variable from previous line)`

with open('WomensWave/WomensWave12519_1.json') as c:
    data1 = json.load(c)
    
with open('WomensWave/WomensWave12519_2.json') as d:
    data2 = json.load(d)
    
with open('WomensWave/WomensWave12519_3.json') as e:
    data3 = json.load(e)    
    
with open('WomensWave/WomensWave12519_5.json') as f:
    data4 = json.load(f)
    
with open('WomensWave/WomensWave12519_6.json') as g:
    data5 = json.load(g)
    
with open('WomensWave/WomensWave12519_7.json') as h:
    data6 = json.load(h)
    
with open('WomensWave/WomensWave12519_8.json') as j:
    data7 = json.load(j)

# Print how many variables each loaded list has


print(f'Data1 has {len(data1)} items')
print(f'Data2 has {len(data2)} items')
print(f'Data3 has {len(data3)} items')
print(f'Data4 has {len(data4)} items')
print(f'Data5 has {len(data5)} items')
print(f'Data6 has {len(data6)} items')
print(f'Data7 has {len(data7)} items')

## Merge all of the loaded lists of data

mergedlist = data1 + data2 + data3 + data4 + data5 + data6 + data7
print('--------------------------------------')

## Print how many items are in the final merged list

print(f'mergedlist has {len(mergedlist)} items')

Data1 has 675 items
--------------------------------------
mergedlist has 675 items


In [4]:
# 1.2
## Loop through each tweet in the mergedlist
## Run `analyzer.polarity_scores` to generate a sentiment analysis result for each twee text
## Add the neg, neu, pos, and compound sentiment analysis results to each tweet

for tweet in mergedlist:
    result = {}
    result = analyzer.polarity_scores(tweet['text'])
    
    tweet.update(sen_negative = result['neg'])
    tweet.update(sen_neutral = result['neu'])
    tweet.update(sen_positive = result['pos'])
    tweet.update(sen_compound = result['compound'])
    
print('sentiment analysis has been added to each tweet')
    

sentiment analysis has been added to each tweet


In [None]:
# 1.3
## Print single complete json return to understand what is contained in each item
## Caution - this prints off a long list of items!

pprint(mergedlist[0])

In [5]:
## Output 1.4
## Save mergedList as a new json file
## json978 is just a random variable name - it can be named anything...

json978 = json.dumps(mergedlist)
f = open("WhyIMarchMergedList_2019.json", "w")
f.write(json978)
f.close
print("The json file was probably saved successfully")

The json file was probably saved successfully


# 2. Munge into Custom DataFrame -> csv



---------------------------------------------------------------------------------------------------------------------
## Sections:

#### 2.1 (Convert to Custom DF)
 - 34 of the tweet's dictionary key/values will be identified, temporarily created in a new `LongList`, then converted into `mungedDF` DataFrame
 - You can easily comment out key/values below or add additional key/values to be included in the `LongList` and `mungedDF`
 - For a longer list of mapped out variables review the list on the bottom of this notebook.

#### 2.2 (Preview DataFrame)
- printing the top 3 rows of the DataFrame for inspection

#### 2.3 (Save)
- Save `mungedDF` as a csv file



In [6]:
# 2.1 (Convert to Custom DF)
#########################
# This section is extracting specific tweet variables to be saved in a csv file.
# To add additional tweet variables, refer to the last section in the notebook for additional options.
# Keep in mind, if any new variables are to be extracted and it is not a required field by Twitter, you need to use
# a try: except: approach or the code will have a key error and fail.
#########################
LongList = []

list_num = 0

#### Begin Loop through Tweets

for x in mergedlist:
    print(f"Tweet Number:   {list_num}")
##########################################################################################################
    #### Created At:
    created_at = x['created_at']
##########################################################################################################
    #### Tweet Text
    Tweet_Text = x['text']
##########################################################################################################
    #### Run Vader Analysis
    compound = x['sen_compound']
    pos = x["sen_negative"]
    neu = x["sen_neutral"]
    neg = x["sen_positive"]
##########################################################################################################
    #### ReTweet Count
    try:
        tweet_reTweet = x['retweet_count']
    except:
        tweet_reTweet = 'null'
##########################################################################################################
    #### Favorite Count
    try:
        favorite_count = x['retweeted_status']['favorite_count']
    except (IndexError, KeyError):
        favorite_count = 'null'
##########################################################################################################
    #### Hashtags
          
    # Hashtag1
    try: 
        Hashtag1 = x['entities']['hashtags'][0]['text']
    except IndexError:
        Hashtag1 = 'null'
          
    # Hashtag2
    try: 
        Hashtag2 = x['entities']['hashtags'][1]['text']
    except IndexError:
        Hashtag2 = 'null'
          
    # Hashtag3
    try: 
        Hashtag3 = x['entities']['hashtags'][2]['text']
    except IndexError:
        Hashtag3 = 'null'
        
    # Hashtag4
    try: 
        Hashtag4 = x['entities']['hashtags'][3]['text']
    except IndexError:
        Hashtag4 = 'null'
    
##########################################################################################################
    ##### Gathering Mentioned Screen Name and Names
    # Mentioned Entry 1
    try: 
        screenname1 = x['entities']['user_mentions'][0]['screen_name']
    except IndexError:
        screenname1 = 'null'
    try: 
        name1 = x['entities']['user_mentions'][0]['name']
    except IndexError:
        name1 = 'null'

    # Mentioned Entry 2
    try: 
        screenname2 = x['entities']['user_mentions'][1]['screen_name']
    except IndexError:
        screenname2 = 'null'

    try: 
        name2 = x['entities']['user_mentions'][1]['name']
    except IndexError:
        name2 = 'null'

    # Mentioned Entry 3
    try: 
        screenname3 = x['entities']['user_mentions'][2]['screen_name']
    except IndexError:
        screenname3 = 'null'
    try: 
        name3 = x['entities']['user_mentions'][2]['name']
    except IndexError:
        name3 = 'null'

    # Mentioned Entry 4
    try: 
        screenname4 = x['entities']['user_mentions'][3]['screen_name']
    except IndexError:
        screenname4 = 'null'
    try: 
        name4 = x['entities']['user_mentions'][3]['name']
    except IndexError:
        name4 = 'null'
##########################################################################################################
    #### Begin User Profile Section
##########################################################################################################
    #### Account name
    User_Name = x['user']['name']
##########################################################################################################
    #### Screen Name
    Screen_Name = x['user']['screen_name']
##########################################################################################################
    #### User Description
    User_Description = x['user']['description']
##########################################################################################################
    #### User location
    try:
        User_Location = x['user']['location']
    except:
        User_Location = 'null'
##########################################################################################################
    #### User Following Count (how many people are they following)
    User_FollowersCt = x['user']['followers_count']
##########################################################################################################
    #### User Followed Count (how many people are following the user)
    User_FriendsCt = x['user']['friends_count']
##########################################################################################################
    #### User Verified
    User_Verified = x['user']['verified']
##########################################################################################################
    #### User_Geo
    try:
        User_Geo = x['geo']
    except:
        User_Geo = 'null'
########################################################################################################## 
    #### User_Place
    try:
        User_Place = x['place']
    except:
        User_Place = 'null'
##########################################################################################################
    #### Tweet_ID
    Tweet_ID = x['id'] 
##########################################################################################################
    #### Tweet_ID_Str
    Tweet_ID_str = x['id_str']
##########################################################################################################
    #### ReTweet_ID
    try:
        ReTweet_ID = x['retweeted_status']['id']
    except:
        ReTweet_ID = 'null'
##########################################################################################################
    #### ReTweet_ID_Str
    try:
        ReTweet_ID_str = x['retweeted_status']['id_str']
    except:
        ReTweet_ID_str = 'null'
##########################################################################################################
    #### Coordinates
    try:
        coordinates = x['coordinates']
    except:
        coordinates = 'null'
##########################################################################################################
    ## Create Dictionary Entry to List
    
    LongList.append({'Created At' : created_at,
                     'Tweet Text' : Tweet_Text,
                     'Sen-Compound' : compound,
                     'Sen-Positive' : pos,
                     'Sen-Negative' : neg,
                     'Sen-Neutral' : neu,
                     'Re-Tweet Ct' : tweet_reTweet,
                     'Favorite Ct' : favorite_count,
                     'Hashtag-1' : Hashtag1,
                     'Hashtag-2' : Hashtag2,
                     'Hashtag-3' : Hashtag3,
                     'Hashtag-4' : Hashtag4,
                     'Mentioned-SN-1' : screenname1,
                     'Mentioned-Name-1' : name1,
                     'Mentioned-SN-2' : screenname2,
                     'Mentioned-Name-2' : name2,
                     'Mentioned-SN-3' : screenname3,
                     'Mentioned-Name-3' : name3,
                     'Mentioned-SN-4' : screenname4,
                     'Mentioned-Name-4' : name4,
                     'Tweet Acct Name' : User_Name,
                     'Screen Name' : Screen_Name,
                     'User Description' : User_Description,
                     'User Location' : User_Location,
                     'User Follower Ct' : User_FollowersCt,
                     'User Friends Ct' : User_FriendsCt,
                     'User Verified' : User_Verified,
                     'User Geo Loc' : User_Geo,
                     'User Place' : User_Place,
                     'Tweet ID' : Tweet_ID,
                     'Tweet ID str' : Tweet_ID_str,
                     'Re-Tweet ID' : ReTweet_ID,
                     'Re-Tweet ID str' : ReTweet_ID_str,
                     'Coordinates' : coordinates
                    })
   

    ## go to the next item
    
    list_num = list_num + 1
    
print('munging complete')

## Convert to DataFrame

mungedDF = pd.DataFrame(LongList)

print('DataFrame did not crash during processing')


Tweet Number:   0
Tweet Number:   1
Tweet Number:   2
Tweet Number:   3
Tweet Number:   4
Tweet Number:   5
Tweet Number:   6
Tweet Number:   7
Tweet Number:   8
Tweet Number:   9
Tweet Number:   10
Tweet Number:   11
Tweet Number:   12
Tweet Number:   13
Tweet Number:   14
Tweet Number:   15
Tweet Number:   16
Tweet Number:   17
Tweet Number:   18
Tweet Number:   19
Tweet Number:   20
Tweet Number:   21
Tweet Number:   22
Tweet Number:   23
Tweet Number:   24
Tweet Number:   25
Tweet Number:   26
Tweet Number:   27
Tweet Number:   28
Tweet Number:   29
Tweet Number:   30
Tweet Number:   31
Tweet Number:   32
Tweet Number:   33
Tweet Number:   34
Tweet Number:   35
Tweet Number:   36
Tweet Number:   37
Tweet Number:   38
Tweet Number:   39
Tweet Number:   40
Tweet Number:   41
Tweet Number:   42
Tweet Number:   43
Tweet Number:   44
Tweet Number:   45
Tweet Number:   46
Tweet Number:   47
Tweet Number:   48
Tweet Number:   49
Tweet Number:   50
Tweet Number:   51
Tweet Number:   52
Twe

In [None]:
# 2.2 (Preview DataFrame)

# Print the top 3 rows of the `mungedDF`

mungedDF.head(3)

In [7]:
# 2.3 (Save)
# Save DataFrame as csv file
# Enter in csv file name in ()
# File will be stored in the same folder as this jupyter notebook unless you
#    specifiy a new path in the csv file name
# Using encoding utf-8 to preserve emoji formatting
# index false so it doesn't create a column in the csv of the index
# IMPORTANT: emoji characters are not displayed in csv files.
#            If the csv file is read back into a pd.DataFrame the emojis will print out as expected.

mungedDF.to_csv('WhyIMarch_MungedData1.csv', encoding = 'utf-8', index = False)
print('The csv file was probably saved successfuly')

# Save `LongList` as json file
# Save mergedList as a new json file
# json123 is just a random variable name - it can be named anything...

json123 = json.dumps(LongList)
f = open("WhyIMarchLongList_2019.json", "w")
f.write(json123)
f.close
print("The json file was probably saved successfully")

The csv file was probably saved successfuly
The json file was probably saved successfully


# Section 3 (Partially Flattened JSON -> pd.DF -> csv)

---------------------------------------------------------------------------------------------------------------------
## Sections:

#### 3.1 (`mergedlist` -> DataFrame)
 - Use `.from_records(mergedlist)` as a quick method to partially flatten the json file
 - Creates approximately 35 columns, many of them are nested.

#### 3.2 (list of columns)
 - Print a list of columns in `Part_Flat_DF`
 
#### 3.3 (save)
 - Save `Part_Flat_DF` to a csv file

In [None]:
# Output 3.1
# Convert merged list of json files into pandas datafame
# Print the first 4 rows of data to inspect.

Part_Flat_DF = pd.DataFrame.from_records(mergedlist)
Part_Flat_DF.head(3)

In [None]:
# 3.2 (List of Column names)
# Print list of DataFrame Columns

list(Part_Flat_DF)

In [None]:
# 3.3 (Save)
# Save DataFrame as csv file
# Enter in csv file name in ()
# File will be stored in the same folder as this jupyter notebook unless you
#    specifiy a new path in the csv file name
# Using encoding utf-8 to preserve emoji formatting
# index false so it doesn't create a column in the csv of the index
# IMPORTANT: emoji characters are not displayed in csv files.
#            If the csv file is read back into a pd.DataFrame the emojis will print out as expected.

Part_Flat_DF.to_csv('WomensMarch_PartialFlattened_json.csv', encoding = 'utf-8', index = False)
print("The csv was successfully saved")

# Section 4 (Flattened JSON -> pd.DF -> csv)

---------------------------------------------------------------------------------------------------------------------
## Sections:

#### 4.1 (`mergedlist` -> DataFrame)
 - Use `.from_records(mergedlist)` as a quick method to partially flatten MOST of the json file
 - Creates approximately 361 columns.
 - Why so many columns?! This method creates a column for every possible nested variable, even if only one of the variables uses it. For example, most tweets have at least 1 hashtag. If a single tweet in the `mergedlist` has 7 hashtags, a column is created for a 7th hastag. The single tweet will have a value in that column and the rest of the rows will display NaN.
 - This format might be helpful if you want every single variable flattened into a csv, however there are going to be a lot of NaN values throughout.

#### 4.2 (list of columns)
 - Print a list of columns in `Flat_DF`
 
#### 4.3 (save)
 - Save `Flat_DF` to a csv file

In [None]:
# 4.1 (Convert json to DataFrame)

# Use `json_normalize` to flatten MOST of the nested dictionaries into columns.

Flat_DF = json_normalize(mergedlist)
Flat_DF.head(3)

In [None]:
# 4.2 (List of columns)
# Caution - this prints out a long list (361 items) of items

list(Flat_DF)

In [None]:
# 4.3 (Save)
# Save DataFrame as csv file
# Enter in csv file name in ()
# File will be stored in the same folder as this jupyter notebook unless you
#    specifiy a new path in the csv file name

Flat_DF.to_csv('WomensMarch_Flattened_json.csv')
print('The csv was saved successfuly')

In [None]:
##########################################################################################################
#Dependencies Plotting with Matplotlib
##########################################################################################################

from datetime import datetime
import matplotlib.pyplot as plt
from matplotlib import style
from matplotlib.pyplot import figure
style.use('ggplot')


In [None]:
# Create plot
fig = plt.figure(figsize=(16,10))
x_vals = sentiments_pd["Tweets Ago"]
y_vals = sentiments_pd["compound"]
plt.scatter(x_vals,
         y_vals,
         marker="o",
         linewidth=0.5,
         alpha=0.8        )

# # Incorporate the other graph properties
now = datetime.now()
now = now.strftime("%Y-%m-%d %H:%M")
plt.title(f"Sentiment Analysis of Tweets ({now}) for {search_term}")
plt.xlim([x_vals.max(),x_vals.min()]) #Bonus
plt.ylabel("Tweet Polarity")
plt.xlabel("Tweets Ago")



plt.show()

In [None]:
###########################
#### Save Figure
###########################

plt.savefig('test.png', bbox_inches = 'tight')

In [None]:
#################################################################################
#################################################################################
####### NOT for running - this is notes and the legand
#################################################################################

#     tweet['created_at']                               # Time Stamp of when tweet was created
#     tweet['id']                                       # tweet id Object (number)
#     tweet['id_str']                                   # tweet id String format (number)
#     tweet['text']                                     # text of tweet
## Hashtags
#     tweet['entities']['hashtags'][0]['text']          # 1st hashtag used
#     tweet['entities']['hashtags'][1]['text']          # 2nd hastag used (if more than one were used)
#     tweet['entities']['hashtags'][2]['text']          # 3rd hashtag used (if more than 2 were used)
#     tweet['entities']['hashtags'][3]['text']          # 4th hashtag used (if more than 3 were used)
## Mentions
#     tweet['entities']['user_mentions'][0]['screen_name'] # screen name of 1st person mentioned
#     tweet['entities']['user_mentions'][0]['name']        # name of person 1st mentioned
#     tweet['entities']['user_mentions'][1]['screen_name'] # screen name of 2nd person mentioned
#     tweet['entities']['user_mentions'][1]['name']        # name of person 2nd mentioned
#     tweet['entities']['user_mentions'][2]['screen_name'] # screen name of 3rd person mentioned
#     tweet['entities']['user_mentions'][2]['name']        # name of person 3rdmentioned
#     tweet['entities']['user_mentions'][3]['screen_name'] # screen name of 4th person mentioned
#     tweet['entities']['user_mentions'][3]['name']        # name of person 4th mentioned
## User Info
#     tweet['user']['id']                               # id (object) of account user
#     tweet['user']['name']                             # name of account user
#     tweet['user']['screen_name']                      # Screen name of person
#     tweet['user']['location']                         # string, user input of their location
#     tweet['user']['description']                      # description of the account user
#     tweet['user']['followers_count']                  # number of accounts user is following
#     tweet['user']['friends_count']                    # number of accounts user is friends with
#     tweet['user']['verified']                         # is the account user 'verified'
#     tweet['geo']                                      # is geo null or on
#     tweet['coordinates']                              # coordinates or null
#     tweet['place']                                    # tweet place description or null
#     tweet['retweeted_status']['id']                   # Original tweet id number object
#     tweet['retweeted_status']['id_str']               # Original tweet id number string
#     tweet['retweet_count']                            # number of times an original tweet has been retweeted
#     tweet['retweeted_status']['favorite_count']       # number of times a tweet has been favorited
## Sentiment Analysis
#     tweet['sen_compound']                             # Sentiment Analysis - Compound Score
#     tweet['sen_negative']                             # Sentiment Analysis - Negative Score
#     tweet['sen_neutral']                              # Sentiment Analysis - Neutral Score
#     tweet[' sen_positive']                            # Sentiment Analysis - Positive Score
  
