## Semester Project - Wikipedia based geolocalization

In this jupyter notebook we will implement a solution that uses the words with the highest tf-idf from the corresponding state wikipedia page to localize tweets. 

* Get the wikipedia page of all the states we are interested in
* Perform a tf-idf analysis on the set
* Get the top 10,25,50 words of each state
* Give a score based on that
* Classify the tweet geolocalization using that score

### Imports

In [50]:
import wikipedia
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer

### Localizer class
This class will save all the states in a list and after fetching the information on wikipedia will also have all the texts corresponding to the state. It also has a function that allows to do a tf-idf analysis.

#### Attributes:
* self.locations [] : All the states we are interested in
* self.texts [] : Text of wikipedia page corresponding to location

#### Functions
* add_SingleLocation(self, location) : allows to add a single location to the list
* add_listLocation(self, locationsList) : allows to add a list of locations to the list
* get_WikiText(self) : will fetch the wikipedia text and add it to self.texts
* printText(self) : will print all the self.texts
* vectorizer(self) : will return the tf-idf of the texts and the corresponding features (i.e. words)

In [40]:
class Localizer:
    
    def __init__(self):
        self.locations = []
        self.texts = []
        
    def add_SingleLocation(self, location):
        self.locations.append(location)

    def add_listLocation(self, locationList):
        if len(self.locations) == 0:
            self.locations = locationList
        else:
            self.locations = self.locations + locationList

    def get_WikiText(self):
        for l in self.locations:
            p = wikipedia.page(str(l))
            self.texts.append(p.content)

    def printText(self):
        for t in self.texts:
            print(t)
            
    def vectorizer(self):
        vectorizer = CountVectorizer(stop_words = 'english')
        X = vectorizer.fit_transform(self.texts)
        features = vectorizer.get_feature_names()
        return X, features

In [41]:
test = ["New York State", "New Jersey State"]

L = Localizer()
L.add_listLocation(test)
L.get_WikiText()

In [15]:
L.printText()

New York is a state in the northeastern United States. New York was one of the original thirteen colonies that formed the United States. With an estimated 19.85 million residents in 2017, it is the fourth most populous state in the United States. To differentiate from its city with the same name, it is sometimes called New York State.
The state's largest city, New York City, makes up over 40% of the state's population. Two-thirds of the state's population lives in the New York metropolitan area, and nearly 40% lives on Long Island. The state and city were both named for the 17th-century Duke of York, future King James II of England. With an estimated population of 8.55 million in 2015, New York City is the most populous city in the United States and the premier gateway for legal immigration to the United States. The New York Metropolitan Area is one of the most populous in the world. New York City is a global city, home to the United Nations Headquarters and has been described as the c

In [42]:
testx, testy = L.vectorizer()

In [17]:
print(testx)

  (0, 2990)	1
  (0, 1477)	1
  (0, 1319)	1
  (0, 2526)	1
  (0, 2549)	1
  (0, 1738)	1
  (0, 321)	1
  (0, 3)	1
  (0, 358)	1
  (0, 460)	1
  (0, 2338)	1
  (0, 3267)	1
  (0, 3050)	1
  (0, 2045)	2
  (0, 851)	2
  (0, 1998)	2
  (0, 4105)	2
  (0, 219)	1
  (0, 2948)	2
  (0, 3829)	1
  (0, 3119)	1
  (0, 1956)	1
  (0, 110)	1
  (0, 2153)	1
  (0, 3407)	1
  :	:
  (1, 204)	4
  (1, 3522)	24
  (1, 2747)	9
  (1, 425)	1
  (1, 121)	2
  (1, 1664)	6
  (1, 605)	33
  (1, 4455)	83
  (1, 1872)	4
  (1, 4108)	45
  (1, 1100)	3
  (1, 4127)	1
  (1, 3012)	1
  (1, 2951)	434
  (1, 2977)	39
  (1, 4388)	98
  (1, 3919)	43
  (1, 4278)	27
  (1, 2912)	4
  (1, 4109)	921
  (1, 2228)	428
  (1, 3915)	151
  (1, 2337)	121
  (1, 4491)	52
  (1, 2884)	330


In [18]:
print(testy)

['000', '013', '016', '031', '038', '040', '052', '057', '06', '073', '07652', '079', '093', '10', '100', '10021', '102', '108', '109', '11', '110', '113', '114', '116', '119', '11th', '12', '120', '121', '128', '12th', '13', '135', '138', '139', '13th', '14', '141', '142', '146', '147', '14th', '15', '150', '1524', '1540', '16', '160', '1600s', '1609', '1614', '1617', '1623', '1624', '1625', '163', '1630', '1640', '1647', '1649', '165', '1653', '166', '1664', '1672', '1673', '1674', '1680', '169', '1692', '16th', '17', '1700s', '1702', '1708', '172', '1738', '1760s', '1765', '1775', '1776', '1777', '1778', '1779', '1783', '1786', '1787', '1788', '1789', '1790', '1797', '17th', '18', '180', '1800s', '1804', '1807', '1812', '1817', '182', '1825', '1827', '1831', '1834', '1844', '1847', '185', '1850', '1855', '1857', '1860', '1862', '1864', '1869', '1878', '1885', '1886', '1890', '1892', '1894', '18th', '19', '1900', '1904', '1907', '1909', '1911', '1917', '1918', '1921', '1923', '1924',

In [43]:
def top_tfidf_feats(row, features, top_n=25):
    ''' Get top n tfidf values in row and return them with their corresponding feature names.'''
    topn_ids = np.argsort(row)[::-1][:top_n]
    top_feats = [(features[i], row[i]) for i in topn_ids]
    df = pd.DataFrame(top_feats)
    df.columns = ['feature', 'tfidf']
    return df

def top_feats_in_doc(Xtr, features, row_id, top_n=25):
    ''' Top tfidf features in specific document (matrix row) '''
    row = np.squeeze(Xtr[row_id].toarray())
    return top_tfidf_feats(row, features, top_n)


In [23]:
print(top_tfidf_feats(testx[0], testy))

  feature                                              tfidf
0     000    (0, 2990)\t1\n  (0, 1477)\t1\n  (0, 1319)\t1...


In [44]:
print(top_feats_in_doc(testx, testy, 1))

       feature  tfidf
0          new    330
1       jersey    306
2        state    151
3       county    116
4         city     74
5         york     52
6       states     43
7   population     40
8       bergen     34
9     governor     33
10    delaware     32
11      newark     29
12    counties     29
13       river     29
14      united     27
15         tax     25
16     service     25
17   residents     24
18     largest     24
19         bus     24
20        2010     23
21      hudson     22
22    atlantic     22
23    american     22
24    national     21


In [45]:
print(top_feats_in_doc(testx, testy, 0))

       feature  tfidf
0          new    352
1         york    311
2        state    157
3         city    108
4       states     61
5       united     55
6       island     52
7     national     46
8      largest     42
9   population     40
10        long     33
11      county     31
12   manhattan     30
13        area     27
14      hudson     26
15  university     26
16    american     26
17       river     26
18        lake     23
19        park     21
20  government     21
21   including     21
22       major     20
23       world     19
24        home     19


In [54]:
def top_mean_feats(Xtr, features, grp_ids=None, min_tfidf=0.1, top_n=25):
    ''' Return the top n features that on average are most important amongst documents in rows
        indentified by indices in grp_ids. '''
    if grp_ids:
        D = Xtr[grp_ids].toarray()
    else:
        D = Xtr.toarray()

    D[D < min_tfidf] = 0
    tfidf_means = np.mean(D, axis=0)
    return top_tfidf_feats(tfidf_means, features, top_n)

def top_feats_by_class(Xtr, y, features, min_tfidf=0.1, top_n=25):
    ''' Return a list of dfs, where each df holds top_n features and their mean tfidf value
        calculated across documents with the same class label. '''
    dfs = []
    labels = np.unique(y)
    for label in labels:
        ids = np.where(y==label)
        feats_df = top_mean_feats(Xtr, features, ids, min_tfidf=min_tfidf, top_n=top_n)
        feats_df.label = label
        dfs.append(feats_df)
    return dfs

def plot_tfidf_classfeats_h(dfs):
    ''' Plot the data frames returned by the function plot_tfidf_classfeats(). '''
    fig = plt.figure(figsize=(12, 9), facecolor="w")
    x = np.arange(len(dfs[0]))
    for i, df in enumerate(dfs):
        ax = fig.add_subplot(1, len(dfs), i+1)
        ax.spines["top"].set_visible(False)
        ax.spines["right"].set_visible(False)
        ax.set_frame_on(False)
        ax.get_xaxis().tick_bottom()
        ax.get_yaxis().tick_left()
        ax.set_xlabel("Mean Tf-Idf Score", labelpad=16, fontsize=14)
        ax.set_title("label = " + str(df.label), fontsize=16)
        ax.ticklabel_format(axis='x', style='sci', scilimits=(-2,2))
        ax.barh(x, df.tfidf, align='center', color='#3F5D7D')
        ax.set_yticks(x)
        ax.set_ylim([-1, x[-1]+1])
        yticks = ax.set_yticklabels(df.feature)
        plt.subplots_adjust(bottom=0.09, right=0.97, left=0.15, top=0.95, wspace=0.52)
    plt.show()

In [51]:
df0 = top_feats_in_doc(testx, testy, 0)
df1 = top_feats_in_doc(testx, testy, 1)

plot_tfidf_classfeats_h([df0, df1])

AttributeError: 'DataFrame' object has no attribute 'label'

In [64]:
bigger_test = ['Alabama',
            'Alaska state',
            'Arizona state',
            'Arkansas state',
            'California state',
            'Colorado state',
            'Connecticut state',
            'Delaware state',
            'Florida state',
            'Georgia state',
            'Hawaii state',
            'Idaho state',
            'Illinois state',
            'Indiana state',
            'Iowa state',
            'Kansas state',
            'Kentucky state',
            'Louisiana state',
            'Maine state',
            'Maryland state',
            'Massachusetts state',
            'Michigan state',
            'Minnesota state',
            'Mississippi state',
            'Missouri state',
            'Montana state',
            'Nebraska state',
            'Nevada state',
            'New Hampshire state',
            'New Jersey state',
            'New Mexico state',
            'New York state',
            'North Carolina state',
            'North Dakota state',
            'Ohio state',
            'Oklahoma state',
            'Oregon state',
            'Pennsylvania state',
            'Rhode Island state',
            'South Carolina state',
            'South Dakota state',
            'Tennessee state',
            'Texas state',
            'Utah state',
            'Vermont state',
            'Virginia state',
            'Washington state',
            'West Virginia state',
            'Wisconsin state',
            'Wyoming state',
            'Ontario',
            'Quebec',
            'Nova Scotia',
            'New Brunswick',
            'Manitoba',
            'British Columbia',
            'Prince Edward state',
            'Saskatchewan state',
            'Alberta state',
            'Newfoundland and Labrador state',
            'Washington, D.C. state',
            'Chihuahua state',
            'Baja California state',
            'Freeport bahamas',
            'Nuevo Leon',
              ]

In [65]:
L = Localizer()
L.add_listLocation(bigger_test)
L.get_WikiText()

In [66]:
testx, testy = L.vectorizer()

In [69]:
print(top_feats_in_doc(testx, testy, 3))

            feature  tfidf
0          arkansas     40
1             state     33
2               asu     25
3            player     22
4        university     18
5               nfl     18
6            campus     14
7            degree     13
8            county     12
9             alpha     11
10           member     11
11  representatives     10
12            house     10
13          college     10
14         programs      8
15             year      8
16         district      8
17             fall      7
18            beebe      7
19             home      7
20          program      7
21           school      6
22       republican      6
23             2013      6
24        jonesboro      6


In [79]:
import pickle
filename = open('data.pkl', 'wb')
pickle.dump(L, filename)
f.close()

In [None]:
pickle.load(data.pkl)