In [142]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


### View The Data
I will start by seeing what we got from our data collection. How many rows have location data and what does this data look like.

In [143]:
mydata = "tweets1016.csv"
df = pd.read_csv(mydata)
df.head(5)

Unnamed: 0.1,Unnamed: 0,t_id,s_name,t_text,u_location,image_url,t_date
0,0.0,1052182972751929344,Imposter_Cat,RT @MidlandsHECF: How can we gather research u...,impostersyndrome.life,http://pbs.twimg.com/profile_images/1047849327...,2018-10-16 13:02:23
1,1.0,1052182953487491079,MidlandsHECF,How can we gather research using pedagogic too...,"The Midlands, UK",http://pbs.twimg.com/profile_images/1003594466...,2018-10-16 13:02:19
2,2.0,1052182382026125313,Imposter_Cat,RT @DigiLeaders: Women in the workplace ‘Loud ...,impostersyndrome.life,http://pbs.twimg.com/profile_images/1047849327...,2018-10-16 13:00:03
3,3.0,1052182375877226497,DigiLeaders,"Women in the workplace ‘Loud and clear, you be...",Global,http://pbs.twimg.com/profile_images/8796744861...,2018-10-16 13:00:01
4,4.0,1052182356197613568,ercarrigan,RT @nelsonlflores: Tips for avoiding imposter ...,,http://pbs.twimg.com/profile_images/1021248115...,2018-10-16 12:59:56


In [144]:
df.dtypes

Unnamed: 0    float64
t_id           object
s_name         object
t_text         object
u_location     object
image_url      object
t_date         object
dtype: object

In [145]:
df.shape

(11210, 7)

In [146]:
df.apply(lambda x: sum(x.isnull()),axis=0)

Unnamed: 0     110
t_id             0
s_name           0
t_text           0
u_location    2188
image_url        0
t_date           0
dtype: int64

We see that we have 11,210 tweets and all but 2188 of them have some sort of location data set. The field we will be cleaning is u_locations. So, lets take a look at the uniq values that are inside this field.

In [147]:

df['u_location'].value_counts(ascending=False).head(20)

impostersyndrome.life    1677
u_location                110
London, England           106
London                     97
Los Angeles, CA            89
Chicago, IL                87
United States              71
Washington, DC             64
New York, NY               59
Atlanta, GA                58
Seattle, WA                52
California, USA            47
Houston, TX                42
UK                         41
United Kingdom             41
San Francisco, CA          41
Canada                     39
Toronto, Ontario           38
Boston, MA                 36
Brooklyn, NY               35
Name: u_location, dtype: int64

In [148]:
df['u_location'].value_counts(ascending=False).tail(20)

Old Trafford                      1
sometimes here sometimes there    1
Some sewer                        1
Manton, North Lincs               1
TO/tkaronto                       1
Nairobi, Kenya ✈️ dMv / 717       1
New York, New York                1
Cold                              1
Montrose, CO                      1
Sol III, Terran System            1
Wilpshire, England                1
Liminal space dweller             1
ʙᴏʟɪᴠɪᴀ; ʀᴏʟᴇᴘʟᴀʏᴇʀ; ᴍᴜʟᴛɪғᴀɴᴅ    1
Athens, Greece                    1
Paraguay                          1
Toulouse, France                  1
                                  1
ÜT: 40.647547,-73.928371          1
Boogie down                       1
Oceanside • ΓΦΒ☽                  1
Name: u_location, dtype: int64

## What are we looking for?
Ok, some places are real and some are not. Looking at the data I am going to stick with locations in the USA and locations that have been picked via the drop-down menu on twitter. Here is what they look like:
#### City, State:
Maplewood, NJ
#### State, County:
New Jersey, USA
#### Country:
United States

1. Entries will have a single comma or be "United States"
2. Full city name with a state abbriviation
3. Full state name with a country abbriviation
4. Full country name

#### What we want, is to convert it to one of the following:

Maplewood, New Jersey, USA

New Jersey, USA

USA

## Cleaning
1. Remove empty location rows
2. Convert NJ to New Jersey, USA
3. Convert United States to USA
4. Keep only fileds that have USA in them

In [149]:
df['get_rt'] = df['t_text'].str.startswith('RT')
df.drop(df[df['get_rt'] == True].index, inplace=True)
df.shape

(2753, 8)

In [150]:
df.dropna(subset=['u_location'], how='all', inplace = True)

In [151]:
df.apply(lambda x: sum(x.isnull()),axis=0)

Unnamed: 0    110
t_id            0
s_name          0
t_text          0
u_location      0
image_url       0
t_date          0
get_rt          0
dtype: int64

In [152]:
df.shape

(2312, 8)

We started with a little over 11k tweets and are down to around 9k. Now we will convert NJ to New Jersey, USA and United States to USA.

In [153]:
us_state_abbrev = {
    'AL': 'Alabama, USA',
    'AK': 'Alaska, USA',
    'AZ': 'Arizona, USA',
    'AR': 'Arkansas, USA',
    'CA': 'California, USA',
    'CO': 'Colorado, USA',
    'CT': 'Connecticut, USA',
    'DE': 'Delaware, USA',
    'FL': 'Florida, USA',
    'GA': 'Georgia, USA',
    'HI': 'Hawaii, USA',
    'ID': 'Idaho, USA',
    'IL': 'Illinois, USA',
    'IN': 'Indiana, USA',
    'IA': 'Iowa, USA',
    'KS': 'Kansas, USA',
    'KY': 'Kentucky, USA',
    'LA': 'Louisiana, USA',
    'ME': 'Maine, USA',
    'MD': 'Maryland, USA',
    'MA': 'Massachusetts, USA',
    'MI': 'Michigan, USA',
    'MN': 'Minnesota, USA',
    'MS': 'Mississippi, USA',
    'MO': 'Missouri, USA',
    'MT': 'Montana, USA',
    'NE': 'Nebraska, USA',
    'NV': 'Nevada, USA',
    'NH': 'New Hampshire, USA',
    'NJ': 'New Jersey, USA',
    'NM': 'New Mexico, USA',
    'NY': 'New York, USA',
    'NC': 'North Carolina, USA',
    'ND': 'North Dakota, USA',
    'OH': 'Ohio, USA',
    'OK': 'Oklahoma, USA',
    'OR': 'Oregon, USA',
    'PA': 'Pennsylvania, USA',
    'RI': 'Rhode Island, USA',
    'SC': 'South Carolina, USA',
    'SD': 'South Dakota, USA',
    'TN': 'Tennessee, USA',
    'TX': 'Texas, USA',
    'UT': 'Utah, USA',
    'VT': 'Vermont, USA',
    'VA': 'Virginia, USA',
    'WA': 'Washington, USA',
    'WV': 'West Virginia, USA',
    'WI': 'Wisconsin, USA',
    'WY': 'Wyoming, USA',
    'United States': 'USA'
}

In [154]:
# Convert 2 letter abbrv. to State name, USA
df['u_location'] = df['u_location'].replace(us_state_abbrev, regex=True)

In [155]:
usa_counts = df['u_location'].str.count('USA')
# usa_counts
df['with_usa'] = np.where((usa_counts == 1),df.u_location, "")

In [156]:
df['with_usa'].value_counts(ascending=False).tail(10)

Aubrey, Texas, USA                     1
Waco, Texas, USA                       1
Miami, Florida, USA                    1
West Hollywood, California, USA        1
Lewisburg, Pennsylvania, USA           1
South Pasadena, California, USA        1
Jacksonville, Florida, USA             1
Los Gatos, California, USA             1
Ohio University (Athens, Ohio, USA)    1
HTexas, USA                            1
Name: with_usa, dtype: int64

In [157]:
df['with_usa'].value_counts(ascending=False).head(20)

                                   1599
Seattle, Washington, USA             29
USA                                  26
New York, New York, USA              25
Los Angeles, California, USA         25
San Francisco, California, USA       19
Chicago, Illinois, USA               18
California, USA                      17
Atlanta, Georgia, USA                15
Portland, Oregon, USA                15
New York, USA                        15
Brooklyn, New York, USA              14
Boston, Massachusetts, USA           13
Austin, Texas, USA                   10
Houston, Texas, USA                   9
Philadelphia, Pennsylvania, USA       8
New York,New York, USA                8
Minneapolis, Minnesota, USA           8
San Diego, California, USA            7
New York, USAC                        7
Name: with_usa, dtype: int64

#### Evaluate
This is much better but we are still getting some Krust, I am going to see if I can clean this up a bit more by removing any enty that has a special char. or does not end in USA

In [158]:
#df[:50]

In [159]:
end_usa = df['u_location'].str.endswith('USA')
df['with_usa'] = np.where((end_usa == True),df.with_usa, "")
#df['with_usa'].value_counts(ascending=False).tail(120)

In [160]:
# No junk in the location filed
df['with_usa'] = df['with_usa'].str.replace('[^a-zA-Z\s,]', '')


In [161]:
# make sure all fields are populated
df.dropna(inplace = True)

In [162]:
df['usa_count'] = df['with_usa'].str.count('USA')
df.drop(df[df['usa_count'] == 0].index, inplace=True)
df

Unnamed: 0.1,Unnamed: 0,t_id,s_name,t_text,u_location,image_url,t_date,get_rt,with_usa,usa_count
11,11.0,1052181339494129669,AngeloKnox6,"TIL about imposter syndrome, a psychological c...","Irvine, California, USA",http://pbs.twimg.com/profile_images/8910286096...,2018-10-16 12:55:54,False,"Irvine, California, USA",1
25,25.0,1052179193797193728,MADly_INsane,Sometimes the imposter syndrome jumps out and ...,"New York, New York, USA",http://pbs.twimg.com/profile_images/1033417894...,2018-10-16 12:47:23,False,"New York, New York, USA",1
47,47.0,1052176516069642241,ksmithedu,"Imposter Syndrome, and strategies for overcomi...","Raleigh, North Carolina, USA",http://pbs.twimg.com/profile_images/5242627518...,2018-10-16 12:36:44,False,"Raleigh, North Carolina, USA",1
71,71.0,1052172522626658304,LauraAnnRussell,“How to Overcome Impostor Syndrome” because im...,"Opelika, Alabama, USA",http://pbs.twimg.com/profile_images/9616826618...,2018-10-16 12:20:52,False,"Opelika, Alabama, USA",1
121,20.0,1052162066033889281,wolfyseyes,It's extremely hard not to think this confirms...,"CNew York, USA",http://pbs.twimg.com/profile_images/1022792215...,2018-10-16 11:39:19,False,"CNew York, USA",1
224,22.0,1052130637979635712,icecry,Imposter syndrome has me tripping up right bef...,"Los Angeles, California, USA",http://pbs.twimg.com/profile_images/1046163324...,2018-10-16 09:34:26,False,"Los Angeles, California, USA",1
358,55.0,1052084745251651584,saint_luxe,I finally overcame my imposter syndrome and jo...,"Los Angeles, California, USA",http://pbs.twimg.com/profile_images/9481036256...,2018-10-16 06:32:04,False,"Los Angeles, California, USA",1
434,30.0,1052061362229104643,JMSchuurmans,For all of us @FlatironSchool and everywhere w...,"Portland, Oregon, USA",http://pbs.twimg.com/profile_images/1025102656...,2018-10-16 04:59:09,False,"Portland, Oregon, USA",1
454,50.0,1052053131956289536,kvslice,"Oh hey, just sitting here wide awake after mid...","Durham, North Carolina, USA",http://pbs.twimg.com/profile_images/9918378225...,2018-10-16 04:26:27,False,"Durham, North Carolina, USA",1
501,97.0,1052044302963171328,JMichaelMahr,(Apparently wanting to be effective and effici...,"Davenport, Iowa, USA",http://pbs.twimg.com/profile_images/1035578458...,2018-10-16 03:51:22,False,"Davenport, Iowa, USA",1


In [163]:
# we can only have up to two commas, drop rows with more
comma_counts = df['with_usa'].str.count(',')
df['with_comma'] = np.where((comma_counts <= 2),df.with_usa, "")



#### Split into three different rows

In [164]:
df[['Country', 'State', 'City']] = df['with_comma'].apply(lambda x:pd.Series(x.split(",")[::-1]))

In [170]:
df['Country'] = df['Country'].str.strip()
df['State'] = df['State'].str.strip()
df['City'] = df['City'].str.strip()


In [171]:
df.drop(['with_usa', 'usa_count', 'with_comma', 'get_rt'], axis=1)


Unnamed: 0.1,Unnamed: 0,t_id,s_name,t_text,u_location,image_url,t_date,Country,State,City
11,11.0,1052181339494129669,AngeloKnox6,"TIL about imposter syndrome, a psychological c...","Irvine, California, USA",http://pbs.twimg.com/profile_images/8910286096...,2018-10-16 12:55:54,USA,California,Irvine
25,25.0,1052179193797193728,MADly_INsane,Sometimes the imposter syndrome jumps out and ...,"New York, New York, USA",http://pbs.twimg.com/profile_images/1033417894...,2018-10-16 12:47:23,USA,New York,New York
47,47.0,1052176516069642241,ksmithedu,"Imposter Syndrome, and strategies for overcomi...","Raleigh, North Carolina, USA",http://pbs.twimg.com/profile_images/5242627518...,2018-10-16 12:36:44,USA,North Carolina,Raleigh
71,71.0,1052172522626658304,LauraAnnRussell,“How to Overcome Impostor Syndrome” because im...,"Opelika, Alabama, USA",http://pbs.twimg.com/profile_images/9616826618...,2018-10-16 12:20:52,USA,Alabama,Opelika
121,20.0,1052162066033889281,wolfyseyes,It's extremely hard not to think this confirms...,"CNew York, USA",http://pbs.twimg.com/profile_images/1022792215...,2018-10-16 11:39:19,USA,CNew York,
224,22.0,1052130637979635712,icecry,Imposter syndrome has me tripping up right bef...,"Los Angeles, California, USA",http://pbs.twimg.com/profile_images/1046163324...,2018-10-16 09:34:26,USA,California,Los Angeles
358,55.0,1052084745251651584,saint_luxe,I finally overcame my imposter syndrome and jo...,"Los Angeles, California, USA",http://pbs.twimg.com/profile_images/9481036256...,2018-10-16 06:32:04,USA,California,Los Angeles
434,30.0,1052061362229104643,JMSchuurmans,For all of us @FlatironSchool and everywhere w...,"Portland, Oregon, USA",http://pbs.twimg.com/profile_images/1025102656...,2018-10-16 04:59:09,USA,Oregon,Portland
454,50.0,1052053131956289536,kvslice,"Oh hey, just sitting here wide awake after mid...","Durham, North Carolina, USA",http://pbs.twimg.com/profile_images/9918378225...,2018-10-16 04:26:27,USA,North Carolina,Durham
501,97.0,1052044302963171328,JMichaelMahr,(Apparently wanting to be effective and effici...,"Davenport, Iowa, USA",http://pbs.twimg.com/profile_images/1035578458...,2018-10-16 03:51:22,USA,Iowa,Davenport


In [167]:
df.to_csv('tweets_clean_new.csv')