# Exploratory Data Analysis

## Importing data into pandas

In [2]:
import pandas as pd
import numpy as np


data = pd.read_csv('./nlp-getting-started/train.csv')

data.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [3]:
data.describe(include='all')

Unnamed: 0,id,keyword,location,text,target
count,7613.0,7552,5080,7613,7613.0
unique,,221,3341,7503,
top,,fatalities,USA,11-Year-Old Boy Charged With Manslaughter of T...,
freq,,45,104,10,
mean,5441.934848,,,,0.42966
std,3137.11609,,,,0.49506
min,1.0,,,,0.0
25%,2734.0,,,,0.0
50%,5408.0,,,,0.0
75%,8146.0,,,,1.0


In [7]:
pd.set_option('display.max_rows', len(data))
data

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1
5,8,,,#RockyFire Update => California Hwy. 20 closed...,1
6,10,,,#flood #disaster Heavy rain causes flash flood...,1
7,13,,,I'm on top of the hill and I can see a fire in...,1
8,14,,,There's an emergency evacuation happening now ...,1
9,15,,,I'm afraid that the tornado is coming to our a...,1


### Analysis
- there are 7613 data points
- **99.198739%** of the data has **keywords**
- **66.73%** of the data has **location** points
- the top key word used to extract tweets is **fatalities**
- data is ordered in terms of keyword used to extact the tweet from twitter 
- therefore shuffle the data to mix it.
- some of the data contains the # symbol which causes an error when the data is exported onto a numpy array
- elements in the **text** column which does not have " " marks should not include **,** 
- data in the **location** column may also include **,** marks which will be read as a column delimeter by **np**

### Decisions 

- the most important columns are the text and target columns
- the text column contains information about the tweet
- the keyword column can be discarded because the keyword appears within the tweet itself.
- the location column can be discarded because only 66.73% have a location value associated with them. Droping 33% of the data is impractical
- it is however worth exploring whether location of tweet has an impact on the real or fake status of a tweet

- in some locations such as a city centre there cannot be a veld fire - so that is a consideration to be made

### Exploring the location column

In [5]:
data_for_exploring = data.copy()
data_for_exploring.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [6]:
data_for_exploring['location'].isnull()

0       True
1       True
2       True
3       True
4       True
        ... 
7608    True
7609    True
7610    True
7611    True
7612    True
Name: location, Length: 7613, dtype: bool

In [7]:
data_after_null_removal = data_for_exploring.copy()
data_after_null_removal = data_after_null_removal.dropna(subset=['location'])
data_after_null_removal.describe(include='all')

Unnamed: 0,id,keyword,location,text,target
count,5080.0,5080,5080,5080,5080.0
unique,,221,3341,5028,
top,,collision,USA,#Bestnaijamade: 16yr old PKK suicide bomber wh...,
freq,,36,104,6,
mean,5407.112598,,,,0.432283
std,3116.359041,,,,0.495442
min,48.0,,,,0.0
25%,2728.75,,,,0.0
50%,5360.5,,,,0.0
75%,8086.0,,,,1.0


In [8]:
data_after_null_removal.head()

Unnamed: 0,id,keyword,location,text,target
31,48,ablaze,Birmingham,@bbcmtd Wholesale Markets ablaze http://t.co/l...,1
32,49,ablaze,Est. September 2012 - Bristol,We always try to bring the heavy. #metal #RT h...,0
33,50,ablaze,AFRICA,#AFRICANBAZE: Breaking news:Nigeria flag set a...,1
34,52,ablaze,"Philadelphia, PA",Crying out for more! Set me ablaze,0
35,53,ablaze,"London, UK",On plus side LOOK AT THE SKY LAST NIGHT IT WAS...,0


In [10]:
data_after_null_removal['location'].unique()

array(['Birmingham', 'Est. September 2012 - Bristol', 'AFRICA', ...,
       'Vancouver, Canada', 'London ', 'Lincoln'], dtype=object)

In [14]:
from collections import defaultdict

myDictionary = defaultdict()
mydict = {}


nu = np.array(data_after_null_removal)


for i in nu:
    if i[2] in mydict:
        mydict[i[2]] = mydict.get(i[2]) + i[4]
    else:
        mydict.setdefault(i[2],i[4])

sumTable = pd.DataFrame(columns=['location','sum'])

mydict

# for i in mydict.items():
#     sumTable['location'] = i[0]
#     sumTable['sum'] = i[1]
    


#     pd.df[columnA].map(lambda v: v['Revenue'])

{'Birmingham': 3,
 'Est. September 2012 - Bristol': 0,
 'AFRICA': 1,
 'Philadelphia, PA': 2,
 'London, UK': 5,
 'Pretoria': 0,
 'World Wide!!': 1,
 'Paranaque City': 0,
 'Live On Webcam': 0,
 'milky way': 0,
 'GREENSBORO,NORTH CAROLINA': 1,
 'England.': 0,
 'Sheffield Township, Ohio': 1,
 'India': 20,
 'Barbados': 1,
 'Anaheim': 1,
 'Abuja': 0,
 'USA': 67,
 'South Africa': 2,
 'Sao Paulo, Brazil': 0,
 'hollywoodland ': 0,
 'Edmonton, Alberta - Treaty 6': 1,
 'Inang Pamantasan': 0,
 'Twitter Lockout in progress': 0,
 'Concord, CA': 1,
 'Calgary, AB': 3,
 'San Francisco': 6,
 'CLVLND': 0,
 'Nashville, TN': 7,
 'Santa Clara, CA': 1,
 'UK': 16,
 'St. Louis, MO': 1,
 'Walker County, Alabama': 1,
 'Australia': 9,
 'North Carolina': 3,
 'Norf Carolina': 0,
 'San Mateo County, CA': 1,
 'Njoro, Kenya': 1,
 "Your Sister's Bedroom": 1,
 'Arlington, TX': 1,
 'South Bloomfield, OH': 1,
 'New Hanover County, NC': 2,
 'Maldives': 1,
 'Manchester, NH': 2,
 'Wilmington, NC': 1,
 'global': 1,
 'Alberta 

In [13]:
sumTable

Unnamed: 0,location,sum


In [98]:
d={'a':1, 'b': 2}
data=pd.DataFrame()
data['new column'] = [d]
data['new column'][0]
data

Unnamed: 0,new column
0,"{'a': 1, 'b': 2}"


## Data Processing

### Feature extraction techniques

1. One hot encoding
2. One hot encoding using TF-IDF(sp)
3. Word Embedding (Word2Vec and Glove,,, keras embedding layer)


## Model

In [None]:
 Initialize the model parameters in some manner.
- Using the input data and current model parameters, figure out the loss value of the current network weights and biases.
- Figure out how to update the weights and biases such that the loss value decreases.
- Update the weights and biases a certain amount based on the current learning rate.
