# Preprocessing
> Author: [Dawn Graham](https://dawngraham.github.io/)

Quick clean of `monthlytweets.csv`.

Versions used:
- Python 3.6.6
- pandas 0.23.4

### Import libraries

In [1]:
import pandas as pd

### Read in data

In [2]:
tweets = pd.read_csv('../data/raw-data/monthlytweets.csv')

### Initial cleaning

In [3]:
# Drop duplicates
tweets.drop_duplicates(inplace=True)
print(f"Total: {tweets.shape[0]}")
print(f"Unique: {tweets['id'].nunique()}")

Total: 20163
Unique: 20163


In [4]:
sorted(tweets['timestamp'].unique(), reverse=True)[:3]

['timestamp', 'http://bit.ly/yTIrUE\xa0', '2012-12-31 22:22:25']

In [5]:
# # Get number of observations containing word
tweets[tweets['timestamp'].str.contains('http://bit.ly/yTIrUE\xa0', case=False, regex=True)]

Unnamed: 0,timestamp,id,text,user,likes,replies,retweets,query
7454,http://bit.ly/yTIrUE,Noesis_Now,0,0,0,EversourceMA OR EversourceNH OR VelcoVT OR nat...,,


In [6]:
# Drop row 7454
tweets.drop(tweets.index[7454], inplace=True)

# Drop rows where timestamp == timestamp
tweets = tweets[tweets['timestamp'] != 'timestamp']

#Drop rows where query is null
tweets = tweets[tweets['query'].isnull() != True]

### Check shape & types

In [7]:
tweets.shape

(20160, 8)

In [8]:
tweets.dtypes

timestamp    object
id           object
text         object
user         object
likes        object
replies      object
retweets     object
query        object
dtype: object

In [9]:
tweets.isnull().sum()

timestamp    0
id           0
text         0
user         0
likes        0
replies      0
retweets     0
query        0
dtype: int64

### Change to timeseries

In [10]:
# Set `timestamp` to datetime and set it to index
tweets['timestamp'] = pd.to_datetime(tweets['timestamp'])
tweets.set_index('timestamp', inplace=True)
tweets.head()

Unnamed: 0_level_0,id,text,user,likes,replies,retweets,query
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2012-11-01 23:50:22,264152432282578945,"Tom May, CEO of Northeast Utilities, the paren...",EversourceMA,1,1,3,EversourceMA OR EversourceNH OR VelcoVT OR nat...
2012-11-01 23:45:13,264151136792109056,@NYGovCuomo @lipanews @nationalgridus @nyseand...,readyforthenet,0,0,0,EversourceMA OR EversourceNH OR VelcoVT OR nat...
2012-11-01 23:34:44,264148498352590849,Some amazing video from the Wareham microburst...,EversourceMA,1,0,1,EversourceMA OR EversourceNH OR VelcoVT OR nat...
2012-11-01 23:34:20,264148399190851584,@nationalgridus Call me if you need some help ...,sparky1000,0,0,0,EversourceMA OR EversourceNH OR VelcoVT OR nat...
2012-11-01 23:31:56,264147793147490304,Current PSNH statewide w/o power: 885. We're d...,EversourceNH,0,1,8,EversourceMA OR EversourceNH OR VelcoVT OR nat...


### Export as `monthlytweets_cleaned.csv`

In [11]:
tweets.to_csv('../data/monthlytweets_cleaned.csv', index=True)