#Read and Clean the tweets database 


I exported the tweets database called "rockets" from mongodb as a .json file (which can be done directly from the terminal window). Here, I read the rockets.json file and save it in a pandas dataframe. I have 105,783 tweets! Using BeautifulSoup and regular expressions, I remove html and all non-letter characters from the text. Using NLTK's stopwords  corpus, I remove all stopwords. I further clean the text by removing 'http', 'https', '@', and 'rt' (symbolizing retweets) from the text. Finally, I save my clean text as a separate column in the dataframe and export the whole file in a .csv format for further analysis. 

In [4]:
# Import modules 

import pandas as pd
from pandas import DataFrame, Series
import json
from bs4 import BeautifulSoup
import re
import nltk
from nltk.corpus import stopwords

###Load data

In [5]:
# read the json file and put it in a df

path = 'rockets.json'
record = [json.loads(line) for line in open(path)]
rockets = DataFrame(record)

In [6]:
rockets.shape

(105783, 5)

In [7]:
rockets ['text'][0:25]

0     Getting ready for Game 2 Rockets vs. Warriors ...
1     RT @iamC_Mart: If Lil B curse James Harden, it...
2     Tonight's free pick: \n\nHouston Rockets +10.5...
3                                    Let’s go Rockets!!
4                             rockets boutta win game 2
5     RT @ComplexMag: Rockets fans are begging Lil B...
6     RT @lildelvin_: Rockets can't take this L today 😷
7     RT @SportsCenter: Rockets and Warriors meet fo...
8                 Nah cuh chill https://t.co/41qQRLNevT
9     RT @rjthamacrj: The Rockets are about to play ...
10    RT @SportsCenter: Draymond Green enters arena ...
11                                warriors &gt; rockets
12    RT @_RJack1_: About to lock-in on this Warrior...
13    RT @SportsCenter: Rockets and Warriors meet fo...
14    RT @ESPNStatsInfo: Rockets are +12 with Howard...
15    RT @SBNationNBA: Dwight Howard is in for Game ...
16    Houston Rockets vs. Golden State Warriors: Liv...
17    Houston Rockets vs. Golden State Warriors:

###Clean and preprocess data 

Kaggle has an excellent [tutorial](https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-1-for-beginners-bag-of-words) that helped me get started 

In [8]:
def tweet_to_words( raw_text):
    # Function to convert a raw tweet to a string of words
    # The input is a single string (a raw tweet), and 
    # the output is a single string (a preprocessed tweet)
    
#1. Remove HTML
    review_text = BeautifulSoup(raw_text).get_text() 

# 2. Remove non-letters       
    letters_only = re.sub("[^a-zA-Z]+", " ", review_text) 
    
# 3. Convert to lower case and split into individual words
    words = letters_only.lower().split()                             
    
# 4. In Python, searching a set is much faster than searching a list, so convert the stop words to a set
    stops = set(stopwords.words("english"))                  

# 5. Remove stop words
    meaningful_words = [w for w in words if not w in stops 
                            and 'http' not in w
                            and 'https' not in w
                            and "'" not in w  
                            and not w.startswith('@')
                            and w != 'rt']   
    
# 6. Join the words back into one string separated by space and return the result.
    return( " ".join( meaningful_words ))   

In [9]:
clean_tweet = tweet_to_words( rockets['text'][0] )
print (clean_tweet)

getting ready game rockets vs warriors warriors spashbros nba basketball besureinc co glrexsuchm


In [10]:
# Get the number of tweets based on the dataframe column size
num_tweets = rockets['text'].size

# Initialize an empty list to hold the clean tweets
clean_tweets = []

# Loop over each tweet; create an index i that goes from 0 to the length of the tweet list
for i in range( 0, num_tweets ):
    # Call the function for each one, and add the result to the list of clean tweets
    clean_tweets.append(tweet_to_words( rockets['text'][i] ) )

  '"%s" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client to get the document behind the URL, and feed that document to Beautiful Soup.' % markup)
  '"%s" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client to get the document behind the URL, and feed that document to Beautiful Soup.' % markup)
  '"%s" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client to get the document behind the URL, and feed that document to Beautiful Soup.' % markup)
  '"%s" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client to get the document behind the URL, and feed that document to Beautiful Soup.' % markup)
  '"%s" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client to get the document behind the URL, and feed that document to Beautiful Soup.' % markup)
  '"%s" looks like a URL. Beautiful Soup

In [11]:
clean_tweets[0:5]

['getting ready game rockets vs warriors warriors spashbros nba basketball besureinc co glrexsuchm',
 'iamc mart lil b curse james harden curtains rockets',
 'tonight free pick houston rockets u',
 'let go rockets',
 'rockets boutta win game']

In [12]:
rockets['CleanText'] = clean_tweets

In [13]:
rockets.head()

Unnamed: 0,_id,created_at,geo,source,text,CleanText
0,{'$oid': '555e7c38311e9d08bb45c086'},{'$date': 1432255544000},,Instagram,Getting ready for Game 2 Rockets vs. Warriors ...,getting ready game rockets vs warriors warrior...
1,{'$oid': '555e7c3b311e9d08bb45c087'},{'$date': 1432255545000},,Twitter for Android,"RT @iamC_Mart: If Lil B curse James Harden, it...",iamc mart lil b curse james harden curtains ro...
2,{'$oid': '555e7c3b311e9d08bb45c088'},{'$date': 1432255546000},,Twitter for iPhone,Tonight's free pick: \n\nHouston Rockets +10.5...,tonight free pick houston rockets u
3,{'$oid': '555e7c3b311e9d08bb45c089'},{'$date': 1432255546000},,Tweetbot for iΟS,Let’s go Rockets!!,let go rockets
4,{'$oid': '555e7c3b311e9d08bb45c08a'},{'$date': 1432255546000},,Twitter for iPhone,rockets boutta win game 2,rockets boutta win game


###Save relevant data

In [14]:
clean_rockets = rockets.drop(['_id','created_at','geo','source','text'], axis=1)

In [15]:
clean_rockets.head()

Unnamed: 0,CleanText
0,getting ready game rockets vs warriors warrior...
1,iamc mart lil b curse james harden curtains ro...
2,tonight free pick houston rockets u
3,let go rockets
4,rockets boutta win game


In [16]:
clean_rockets.to_csv('rockets_cleantext.csv',index=False)