<img src="images/generalassembly-open-graph.png" width="240" height="240" align="left"/>

# Data architecture notebook
**Author: Rodolfo Flores Mendez**
<br> May 2019 | Chicago, IL.

### Table of contents
- [Overview](#ov)
- [Importing libraries](#imp)
- [Merging reviews and business data](#me)
- [Construction of distance matrix](#dist)
- [Construction of category matrix](#cat)
- [Script for data architecture as numpy array](#da)

### Overview<a id="ov"></a>
This notebook presents in detail the script used to create the proposed data architecture for businesses in the "area of interest" of "Las Vegas Strip", as explained in the Readme. 

There are (4) key steps to compile the desired data architecture:

   - **(1)** Combine the features from multiple dataframes,
   - **(2)** Construction of distance matrix to mask businesses given a distance criteria (radius of incluence),
   - **(3)** Construction of a cateogry martix to mask businesses given a business category criteria,
   - **(4)** Use the distance and category matrix to loop through specific distnace and cateogry bins to build the 4D tensor for the CNN model,
    
The loop on step (4) to build the data architecture for the CNN model was created using numpy, given that it surpassed the capacity of pandas to manage large sets of multidimensional data. 

### Importing libraries<a id="imp"></a>
In this section we outline the initial code needed to run this workbook. If this code returns an error we recommend the reader to verify that the most up to date version of the libraries mentioned below have been installed in their computers. For a guideline on python installation of modules please refer to the __[official documentation](https://docs.python.org/3/installing/)__.

In [155]:
#Importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import random

#Time library
import time

#Setting max rows and columns
pd.set_option('display.max_columns', 10000)
pd.set_option('display.max_rows', 10000)

#Import library for calculating distance between lat and long
import utm

#Import combination and permutation library
from itertools import combinations

# Import Gensim for Wrod2Vec similarity
import gensim
from gensim.models.word2vec import Word2Vec

#Import Standard Scaler
from sklearn.preprocessing import StandardScaler

### Merging reviews and business data <a id="imp"></a>
In this section we merge the dataframes that were generated on the preprocessing and data extraction notebooks by business id for the "area of interest" of "Las Vegas Strip".

In [83]:
#Read the data from the CSV folder, business dataframe created on the data extraction process
df_business = pd.read_csv('./csv_data/business.csv').drop(columns = 'Unnamed: 0')

#Select relevant columns to build the data architecture, these are lat and long for distance metrics (Y dimension),
#categories for similarity metrics (X dimension),
#and the features we want to model with (such as stars, and review count)
df = df_business[['business_id','latitude','longitude','is_open','categories','stars','review_count','city']]
df = df.set_index('business_id') #Set the index to be the business id

#Visualize the head
df.head()

Unnamed: 0_level_0,latitude,longitude,is_open,categories,stars,review_count,city
business_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1SWheh84yJXfytovILXOAQ,33.522143,-112.018481,0,"Golf, Active Life",3.0,5,Phoenix
QXAEGFB4oINsVuTFxEYKFQ,43.605499,-79.652289,1,"Specialty Food, Restaurants, Dim Sum, Imported...",2.5,128,Mississauga
gnKjwL_1w79qoiV3IC_xQQ,35.092564,-80.859132,1,"Sushi Bars, Restaurants, Japanese",4.0,170,Charlotte
xvX2CttrVhyG2z1dFg_0xw,33.455613,-112.395596,1,"Insurance, Financial Services",5.0,3,Goodyear
HhyxOkGAM07SRYtlQ4wMFQ,35.190012,-80.887223,1,"Plumbing, Shopping, Local Services, Home Servi...",4.0,4,Charlotte


We will limit the dataframe to the "Las Vegas Strip", which is the area of interest for this particular analysis. The code below limits the dataframe for businesses within such geographical area.

**Visual representation of the "area of interest" of the "Las Vegas Strip"**

<img src="images/las_Vegas.jpg" width="1000" height="1000" align="left"/>

In [84]:
#Define lat and long limits
lat_low = 36.092239
lat_up = 36.159071

long_left = -115.234375
long_right = -115.136185

#Mask the dataframe for such lat and long area
mask = (df['latitude']>=lat_low) & (df['latitude']<=lat_up) & (df['longitude']>=long_left) & (df['longitude']<=long_right)
df = df[mask]

#Display the shape
df.shape

(8527, 7)

As we can see, in this particular instance the number of businesses to analyze is 8,527. We will now standarize the data to be able to run it through the data architecture process and the CNN.

In [85]:
#Standarize data
ssFeat = ['stars','review_count']
ss = StandardScaler()
df_ss = pd.DataFrame(ss.fit_transform(df[ssFeat]),columns = ssFeat)
df_ss = pd.concat([df_ss,pd.DataFrame(df.index)],axis=1).set_index('business_id')

#Merge the data
df = pd.merge(df.drop(columns = ['stars','review_count']),
              df_ss,
              how='inner',
              left_index=True,
              right_index=True)

#Inspect the shape
df.shape

  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)


(8527, 7)

In [86]:
#Inspect the head
df.head()

Unnamed: 0_level_0,latitude,longitude,is_open,categories,city,stars,review_count
business_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
iojTeSaoPuxm4WeCzDUA6w,36.129424,-115.184443,1,"Car Rental, Windshield Installation & Repair, ...",Las Vegas,0.876344,-0.163222
Qwt9lOpplBAZ7JBrgAqI7g,36.138452,-115.198019,1,"Home Services, Real Estate, Apartments",Las Vegas,-1.658798,-0.290213
R3rss9fkfJxiOK6DueON3w,36.123107,-115.170253,1,"Shopping, Women's Clothing, Fashion",Las Vegas,0.369315,-0.290213
kANF0dbeoW34s2vwh6Umfw,36.125031,-115.22562,0,"Fast Food, Food, Restaurants, Ice Cream & Froz...",Las Vegas,-1.658798,-0.214559
gas3YSrKkEcBliUHhnOLTg,36.118157,-115.17643,1,"Accessories, Shopping, Fashion, Jewelry, Leath...",Las Vegas,-0.137713,-0.246982


Now we will pull in the features from the review dataframe, merge it with this dataframe.

In [87]:
#Read the df
df_reviews = pd.read_csv('./csv_data/reviews_df.csv').set_index('business_id')
#Visualize the head
df_reviews.head()

Unnamed: 0_level_0,cool,funny,useful,rev_stars,positive_comments,negative_comments,age,t_last_c,t_comments,polarity,subjectivity
business_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
--9e1ONYQuAa-CB_Rrw7Tw,0.025137,0.022688,0.025372,0.781457,4.42191,0.798729,2.149817,-0.652387,2.577041,0.344247,0.590875
--DdmeR16TRb3LsjG0ejrQ,0.082353,0.10274,0.071429,0.583333,-0.279587,-0.295701,-0.035132,1.267161,-0.80034,0.13789,0.506403
--WsruI0IGEoeRmkErU5Gg,0.008824,0.002568,0.011905,0.921875,-0.259057,-0.232561,-1.081193,-0.066477,-1.057948,0.229604,0.543175
--z7PM8AGaJP0aBmGMY7RA,0.007059,0.008219,0.020952,0.95,-0.226794,-0.274654,-0.008271,-0.523299,0.30738,0.233854,0.47647
-0BxAGlIk5DJAGVkpqBXxg,0.014439,0.015878,0.017857,0.494318,-0.200398,-0.064187,1.630943,-0.502538,1.959645,0.199909,0.574302


In [88]:
#Inspect the shape
df_reviews.shape

(8527, 11)

In [89]:
#Merge both dataframes
df= pd.merge(df,
             df_reviews,
             how='inner',
             left_index=True,
             right_index=True)

df['is_closed'] = df['is_open'].apply(lambda x: 1 if x==0 else 0)
#Check the shape
df.shape

(8527, 19)

In [90]:
#Save as CSV
df.to_csv('./csv_data/df_2d_class.csv')
#Inspect the head
df.head()

Unnamed: 0_level_0,latitude,longitude,is_open,categories,city,stars,review_count,cool,funny,useful,rev_stars,positive_comments,negative_comments,age,t_last_c,t_comments,polarity,subjectivity,is_closed
business_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
iojTeSaoPuxm4WeCzDUA6w,36.129424,-115.184443,1,"Car Rental, Windshield Installation & Repair, ...",Las Vegas,0.876344,-0.163222,0.05098,0.005327,0.077601,0.842593,-0.147605,-0.232561,-0.127313,-0.64436,0.259535,0.266891,0.575435,0
Qwt9lOpplBAZ7JBrgAqI7g,36.138452,-115.198019,1,"Home Services, Real Estate, Apartments",Las Vegas,-1.658798,-0.290213,0.007059,0.008219,0.02381,0.3,-0.291319,-0.232561,-1.381839,-0.61679,-1.031203,-0.004517,0.446689,0
R3rss9fkfJxiOK6DueON3w,36.123107,-115.170253,1,"Shopping, Women's Clothing, Fashion",Las Vegas,0.369315,-0.290213,0.042353,0.032877,0.02381,0.7,-0.285453,-0.274654,1.280116,0.673612,0.893604,-0.024604,0.580472,0
kANF0dbeoW34s2vwh6Umfw,36.125031,-115.22562,0,"Fast Food, Food, Restaurants, Ice Cream & Froz...",Las Vegas,-1.658798,-0.214559,0.02246,0.024907,0.026696,0.310606,-0.244392,0.02,1.128602,0.248548,0.996227,0.039826,0.530649,1
gas3YSrKkEcBliUHhnOLTg,36.118157,-115.17643,1,"Accessories, Shopping, Fashion, Jewelry, Leath...",Las Vegas,-0.137713,-0.246982,0.011765,0.011742,0.015873,0.666667,-0.238526,-0.274654,-0.839853,-0.532271,-0.531763,0.259513,0.543087,0


### Construction of distance matrix<a id="dist"></a>

In this section the code to create the distance matrix is detailed.

In [91]:
#Create columns that convert the lat and long to UTM in order to compute Euclidean distance
values = [utm.from_latlon(x,y)[0] for (x,y) in zip(df['latitude'],df['longitude'])]
df['UTM1'] = values
values = [utm.from_latlon(x,y)[1] for (x,y) in zip(df['latitude'],df['longitude'])]
df['UTM2'] = values

#Convert the columns to numpy arrays
lats = df['UTM1'].values
lons = df['UTM2'].values

#Calculate the absolute difference between all the elements of the vector
dlats = np.abs(lats[:, None] - lats[None, :])
dlongs = np.abs(lons[:, None] - lons[None, :])

#Compute Euclidean distance and store on a distance matrix where each row is a vector of the distance of a specific business
#to all the other businesses in the dataset
distances = np.sqrt((dlats)**2 + (dlongs)**2)

#Display the distances shape
distances.shape

(8527, 8527)

In [92]:
#Create a sign matrix to update the distance matrix based on north to south orientation
signs = np.sign(np.array(lats[:, None] - lats[None, :]))

distances = np.multiply(distances,signs)

In [93]:
#Check again the shape
distances.shape

(8527, 8527)

### Construction of category  matrix<a id="cat"></a>

In this section the code to create the category matrix is detailed.It displays if a given business pertains to a specific category, and then uses this output to create category mask of the same dimensions as the distance martix.

In [94]:
#Expand the Categories column into a dummy field
#First obtain a list of all the categories that exist
categories = []
for x in df['categories']:
    if type(x)==float:
        continue
    else:
        categories.extend(x.split(','))

#Clean white spaces
categories = [x.strip() for x in categories]
#Create a list with unique items
categories = [str(x) for x in set(categories)]

#Create a dataframe
frame = []
for x in df['categories']:
    values=[]
    try:
        for cat in categories:
            if cat in [x.strip() for x in x.split(',')]:
                values.append(1)
            else:
                values.append(0)
    except:
        for cat in categories:
            values.append(0)
    
    frame.append(values)

#Store in a DataFrame
df_categories = pd.DataFrame(frame, columns = categories).fillna(0)

#Add the business id column
df_categories.index=df.index
df_categories = df_categories.astype(float)

In [148]:
#Check a vectors head
df_categories['Restaurants'].head()

business_id
iojTeSaoPuxm4WeCzDUA6w    0.0
Qwt9lOpplBAZ7JBrgAqI7g    0.0
R3rss9fkfJxiOK6DueON3w    0.0
kANF0dbeoW34s2vwh6Umfw    1.0
gas3YSrKkEcBliUHhnOLTg    0.0
Name: Restaurants, dtype: float64

In [150]:
df_categories.columns

Index(['Language Schools', 'Toy Stores', 'Colonics', 'Advertising',
       'Decks & Railing', 'Karate', 'Cultural Center', 'Pool Halls', 'Trusts',
       'Video\/Film Production',
       ...
       'Gun\/Rifle Ranges', 'Sporting Goods', 'Henna Artists', 'Tours',
       'Photographers', 'Vocational & Technical School', 'Battery Stores',
       'other_low', 'other_up', 'top'],
      dtype='object', length=953)

In the following lines of code we will loop through the different distance and category bins to build the data architecure.

In [96]:
#In order to define the distance bins, we need to look at the distribution of the distance.
np.min(distances)

-11109.495594971773

In [97]:
np.max(distances)

11109.495594971773

11km is the limit for north and south distance. Hennce we will define bins that do not exceed this limits. We also need to define the similarity between the categories, in order to order the categories from middle to outer based on similarity. For this task we will use the Word2Vec library gesmin and train it with the wikipedia archive.

In [98]:
def strip_split(x):
    result = []
    for word in x.split(','):
        word = word.strip(' ')
        result.append(word)
    return result

df['categories'] = df['categories'].fillna('')
corpus = list(df['categories'].apply(lambda x: strip_split(x)).values)

In [99]:
# Start timer.
t0 = time.time()

# Import word vectors into "model."
model = Word2Vec(corpus,      # Corpus of data.
                 size=100,    # Dimensions
                 window=5,    # Context words
                 min_count=1, # Ignores words below this threshold.
                 sg=0,        # SG = 0 uses CBOW (default).
                 workers=4)   # (parallelizes process).

# Print results of timer.
print(time.time() - t0)

14.896151065826416


In [117]:
#Display top 15 similarities
similarity = model.most_similar('Restaurants',topn = 15)
similarity

  


[('American (New)', 0.9935117959976196),
 ('American (Traditional)', 0.9906449317932129),
 ('Burgers', 0.9882069826126099),
 ('Food', 0.9874913692474365),
 ('Sports Bars', 0.9850655794143677),
 ('Bars', 0.9816811680793762),
 ('Breakfast & Brunch', 0.9816216230392456),
 ('Steakhouses', 0.9791653156280518),
 ('Italian', 0.9789568185806274),
 ('Pizza', 0.9783942699432373),
 ('Sandwiches', 0.9778966903686523),
 ('Nightlife', 0.9777824878692627),
 ('Cocktail Bars', 0.9770600199699402),
 ('Cafes', 0.9761971831321716),
 ('Wine & Spirits', 0.9735256433486938)]

In [163]:
#Store top similarities on a list
top_c = [x for x,y in similarity]
top_c

['American (New)',
 'American (Traditional)',
 'Burgers',
 'Food',
 'Sports Bars',
 'Bars',
 'Breakfast & Brunch',
 'Steakhouses',
 'Italian',
 'Pizza',
 'Sandwiches',
 'Nightlife',
 'Cocktail Bars',
 'Cafes',
 'Wine & Spirits']

In [176]:
#Store 0ther categories in a list
categories = list(model.wv.vocab.keys())
low_c = []
for x in categories:
    if x in top_c:
        pass
    else:
        low_c.append(x)

#Shuffle list
random.shuffle(low_c)
random.shuffle(low_c)
random.shuffle(low_c)

#Pop the restaurants category
low_c.remove('Restaurants')
low_c.remove('')

#Split into two groups
top_c1 = top_c[:int(len(top_c)/2)]
top_c2 = top_c[int(len(top_c)/2):]

#Split into two groups
low_c1 = low_c[:int(len(low_c)/2)]
low_c2 = low_c[int(len(low_c)/2):]

In [177]:
low_c1 + low_c2 == low_c

True

In [178]:
#Create the category boolean for "other", low and up
df_categories['other_low'] = df_categories[low_c1].fillna(0).sum(axis=1).apply(lambda x: 1 if x>0 else 0)
df_categories['other_up'] = df_categories[low_c2].fillna(0).sum(axis=1).apply(lambda x: 1 if x>0 else 0)

#Create categories for top
df_categories['top'] = df_categories[top_c].fillna(0).sum(axis=1).apply(lambda x: 1 if x>0 else 0)

#Rewrite the other categories with zero if first category if the business is already captured in the "top"
df_categories['other_low'] = np.select([(df_categories['top'] == 1) & (df_categories['other_low'] == 1),
                                        (df_categories['top'] == 1) & (df_categories['other_low'] == 0),
                                        (df_categories['top'] == 0) & (df_categories['other_low'] == 1),
                                        (df_categories['top'] == 0) & (df_categories['other_low'] == 0)],
                                       [0,0,1,0],
                                       default = 0)

df_categories['other_up'] = np.select([(df_categories['top'] == 1) & (df_categories['other_up'] == 1),
                                        (df_categories['top'] == 1) & (df_categories['other_up'] == 0),
                                        (df_categories['top'] == 0) & (df_categories['other_up'] == 1),
                                        (df_categories['top'] == 0) & (df_categories['other_up'] == 0)],
                                       [0,0,1,0],
                                       default = 0)

df_categories['other_up'] = np.select([(df_categories['other_low'] == 1) & (df_categories['other_up'] == 1),
                                       (df_categories['other_low'] == 1) & (df_categories['other_up'] == 0),
                                       (df_categories['other_low'] == 0) & (df_categories['other_up'] == 1),
                                       (df_categories['other_low'] == 0) & (df_categories['other_up'] == 0)],
                                      [0,0,1,0],
                                      default = 0)

#Convert everything to float
df_categories = df_categories.astype(float)

In [179]:
#Display the list of bins to include in the model
['other_low'] + top_c1 + ['Restaurants'] + top_c2 +['other_up']

['other_low',
 'American (New)',
 'American (Traditional)',
 'Burgers',
 'Food',
 'Sports Bars',
 'Bars',
 'Breakfast & Brunch',
 'Restaurants',
 'Steakhouses',
 'Italian',
 'Pizza',
 'Sandwiches',
 'Nightlife',
 'Cocktail Bars',
 'Cafes',
 'Wine & Spirits',
 'other_up']

In [180]:
#Visualize the other low and other up vectors (there should be no overlap)
df_categories[['top','other_low','other_up']].tail(20)

Unnamed: 0_level_0,top,other_low,other_up
business_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
HvJwmfRW2JYwalB5DCeH4A,0.0,1.0,0.0
FxhPnGuHqLHkBX9p1jm06Q,1.0,0.0,0.0
rXFMVZtJzuCG6bBlmsRjuQ,0.0,1.0,0.0
TLC594vCsmpdDkT7iu2MeQ,0.0,1.0,0.0
YA9nDSe7_9j9HrQ-qXrjAA,0.0,0.0,1.0
F6ScBoyzVPhuvyJUUdUZ9Q,0.0,1.0,0.0
nySa840axVEJUcAjg_sTCQ,0.0,1.0,0.0
3GfdCuI0YCc5U3rLLLPHUw,1.0,0.0,0.0
vDIZQffxHxbli2u4DsZ4uQ,1.0,0.0,0.0
F2ZWxWi_Ci8nyRiXA0WS4Q,1.0,0.0,0.0


In [183]:
df_categories['other_up'].value_counts()

0.0    7611
1.0     916
Name: other_up, dtype: int64

In [184]:
df_categories['other_low'].value_counts()

1.0    4774
0.0    3753
Name: other_low, dtype: int64

### Script of the data architecture<a id="da"></a>

In this section the code to create the data architecture for the CNN is detailed. The steps it follows are:

   - **(1) Define bins:** Define category and distance bins as lists to compose the "x" and "y" axis of each businesses grid. 
   - **(2) Select features:** Select the features to include in the "feature array". 
   - **(3) Loop through bins to create a numpy array witht he desired architecture** Loops through the criteria defined on steps (1) and (2) and stores a numpy array with the desired 4D architecture.

In [186]:
#Define categories to loop on
categories = ['other_low'] + top_c1 + ['Restaurants'] + top_c2 + ['other_up']

#Define distances to loop on
dist_lst = [
    [5000,int(np.max(distances))],
    [2500,5000],
    [1000,2500],
    [500,1000],
    [250,500],
    [100,250],
    [50,100],
    [0,50],
    [-50,0],
    [-100,-50],
    [-250,-100],
    [-500,-250],
    [-1000,-500],
    [-2500,-1000],
    [-5000,-2500],
    [int(np.min(distances)),-5000]
]

#Define features to loop on
features = ['stars','rev_stars','cool','funny','useful','positive_comments','negative_comments','age','polarity','subjectivity'] 

#Define empty arrays
All = np.empty((distances.shape[0],0)) #For inner loop on cat and distance

#Extract category boolean vector for each category
for cat in categories:
    category = np.array(df_categories[cat]) #Create numpy array
    category = np.tile(category,(distances.shape[0],1)) #Reshape as distances martix
    
    #Loop through distance bins to create distance and category mask
    for dist in dist_lst:
        distance_mask = (distances >= dist[0]) & (distances < dist[1])
        mask = np.multiply(distance_mask,category)
        
        #Loop through features to apply such calculation to the mask
        for feature in features:
            stars = np.tile(df[feature].values,(distances.shape[0],1))
            values = np.multiply(mask,stars)
            values[values==0]=np.nan
            means = np.nan_to_num(np.nanmean(values[:,:],axis=1))
            means = np.where(means!=0,means - df[feature].values,0)
            means = means.reshape(means.shape[0],1)
            
            #Assign to values unique array
            All = np.concatenate((All, means),axis=1)
  
#Reshape all the array into the right format for the CNN
All = All.reshape((distances.shape[0],
                   len(categories),
                   len(dist_lst),
                   len(features)))



In [187]:
#Lets inspect the shape
All.shape

(8527, 18, 16, 10)

In [188]:
#Save the dataset
np.save('./numpy_arrays/cnn_dataset_rest',All)

#Save the target variable
np.save('./numpy_arrays/target_rest',df['is_closed'].values)

#Save the observation's id
np.save('./numpy_arrays/ids_rest',np.array(df.index))

#Save the Main category id (Restaurants) - used for masking this specific business category in the modelling process
np.save('./numpy_arrays/rest_category',df_categories['Restaurants'].values)

**The script above loops over restaurants as the main category of interest**
The scripts that follow loop through other categories of interest such as "Food", "Bars, and "Cafes".

In [189]:
#Create category similarity
similarity = model.most_similar('Food',topn = 15)
top_c = [x for x,y in similarity]
#Store 0ther categories in a list
categories = list(model.wv.vocab.keys())
low_c = []
for x in categories:
    if x in top_c:
        pass
    else:
        low_c.append(x)

#Shuffle list
random.shuffle(low_c)
random.shuffle(low_c)
random.shuffle(low_c)

#Pop the restaurants category
low_c.remove('Food')
low_c.remove('')

#Split into two groups
top_c1 = top_c[:int(len(top_c)/2)]
top_c2 = top_c[int(len(top_c)/2):]

#Split into two groups
low_c1 = low_c[:int(len(low_c)/2)]
low_c2 = low_c[int(len(low_c)/2):]


#Create the category boolean for "other", low and up
df_categories['other_low'] = df_categories[low_c1].fillna(0).sum(axis=1).apply(lambda x: 1 if x>0 else 0)
df_categories['other_up'] = df_categories[low_c2].fillna(0).sum(axis=1).apply(lambda x: 1 if x>0 else 0)

#Create categories for top
df_categories['top'] = df_categories[top_c].fillna(0).sum(axis=1).apply(lambda x: 1 if x>0 else 0)

#Rewrite the other categories with zero if first category if the business is already captured in the "top"
df_categories['other_low'] = np.select([(df_categories['top'] == 1) & (df_categories['other_low'] == 1),
                                        (df_categories['top'] == 1) & (df_categories['other_low'] == 0),
                                        (df_categories['top'] == 0) & (df_categories['other_low'] == 1),
                                        (df_categories['top'] == 0) & (df_categories['other_low'] == 0)],
                                       [0,0,1,0],
                                       default = 0)

df_categories['other_up'] = np.select([(df_categories['top'] == 1) & (df_categories['other_up'] == 1),
                                        (df_categories['top'] == 1) & (df_categories['other_up'] == 0),
                                        (df_categories['top'] == 0) & (df_categories['other_up'] == 1),
                                        (df_categories['top'] == 0) & (df_categories['other_up'] == 0)],
                                       [0,0,1,0],
                                       default = 0)

df_categories['other_up'] = np.select([(df_categories['other_low'] == 1) & (df_categories['other_up'] == 1),
                                       (df_categories['other_low'] == 1) & (df_categories['other_up'] == 0),
                                       (df_categories['other_low'] == 0) & (df_categories['other_up'] == 1),
                                       (df_categories['other_low'] == 0) & (df_categories['other_up'] == 0)],
                                      [0,0,1,0],
                                      default = 0)

#Convert everything to float
df_categories = df_categories.astype(float)

#Define categories to loop on
categories = ['other_low'] + top_c1 + ['Food'] + top_c2 + ['other_up']

#Define distances to loop on
dist_lst = [
    [5000,int(np.max(distances))],
    [2500,5000],
    [1000,2500],
    [500,1000],
    [250,500],
    [100,250],
    [50,100],
    [0,50],
    [-50,0],
    [-100,-50],
    [-250,-100],
    [-500,-250],
    [-1000,-500],
    [-2500,-1000],
    [-5000,-2500],
    [int(np.min(distances)),-5000]
]

#Define features to loop on
features = ['stars','rev_stars','cool','funny','useful','positive_comments','negative_comments','age','polarity','subjectivity']

#Define empty arrays
All = np.empty((distances.shape[0],0)) #For inner loop on cat and distance

#Extract category boolean vector for each category
for cat in categories:
    category = np.array(df_categories[cat]) #Create numpy array
    category = np.tile(category,(distances.shape[0],1)) #Reshape as distances martix
    
    #Loop through distance bins to create distance and category mask
    for dist in dist_lst:
        distance_mask = (distances >= dist[0]) & (distances < dist[1])
        mask = np.multiply(distance_mask,category)
        
        #Loop through features to apply such calculation to the mask
        for feature in features:
            stars = np.tile(df[feature].values,(distances.shape[0],1))
            values = np.multiply(mask,stars)
            values[values==0]=np.nan
            means = np.nan_to_num(np.nanmean(values[:,:],axis=1))
            means = np.where(means!=0,means - df[feature].values,0)
            means = means.reshape(means.shape[0],1)
            
            #Assign to values unique array
            All = np.concatenate((All, means),axis=1)
  
#Reshape all the array into the right format for the CNN
All = All.reshape((distances.shape[0],
                   len(categories),
                   len(dist_lst),
                   len(features)))

#Save the dataset
np.save('./numpy_arrays/cnn_dataset_food',All)

#Save the target variable
np.save('./numpy_arrays/target_food',df['is_closed'].values)

#Save the observation's id
np.save('./numpy_arrays/ids_food',np.array(df.index))

#Save the Main category id (Restaurants)
np.save('./numpy_arrays/food_category',df_categories['Food'].values)

  


Adding Bars ...

In [190]:
#Create category similarity
similarity = model.most_similar('Bars',topn = 15)
top_c = [x for x,y in similarity]
#Store 0ther categories in a list
categories = list(model.wv.vocab.keys())
low_c = []
for x in categories:
    if x in top_c:
        pass
    else:
        low_c.append(x)

#Shuffle list
random.shuffle(low_c)
random.shuffle(low_c)
random.shuffle(low_c)

#Pop the restaurants category
low_c.remove('Bars')
low_c.remove('')

#Split into two groups
top_c1 = top_c[:int(len(top_c)/2)]
top_c2 = top_c[int(len(top_c)/2):]

#Split into two groups
low_c1 = low_c[:int(len(low_c)/2)]
low_c2 = low_c[int(len(low_c)/2):]

#Create the category boolean for "other", low and up
df_categories['other_low'] = df_categories[low_c1].fillna(0).sum(axis=1).apply(lambda x: 1 if x>0 else 0)
df_categories['other_up'] = df_categories[low_c2].fillna(0).sum(axis=1).apply(lambda x: 1 if x>0 else 0)

#Create categories for top
df_categories['top'] = df_categories[top_c].fillna(0).sum(axis=1).apply(lambda x: 1 if x>0 else 0)

#Rewrite the other categories with zero if first category if the business is already captured in the "top"
df_categories['other_low'] = np.select([(df_categories['top'] == 1) & (df_categories['other_low'] == 1),
                                        (df_categories['top'] == 1) & (df_categories['other_low'] == 0),
                                        (df_categories['top'] == 0) & (df_categories['other_low'] == 1),
                                        (df_categories['top'] == 0) & (df_categories['other_low'] == 0)],
                                       [0,0,1,0],
                                       default = 0)

df_categories['other_up'] = np.select([(df_categories['top'] == 1) & (df_categories['other_up'] == 1),
                                        (df_categories['top'] == 1) & (df_categories['other_up'] == 0),
                                        (df_categories['top'] == 0) & (df_categories['other_up'] == 1),
                                        (df_categories['top'] == 0) & (df_categories['other_up'] == 0)],
                                       [0,0,1,0],
                                       default = 0)

df_categories['other_up'] = np.select([(df_categories['other_low'] == 1) & (df_categories['other_up'] == 1),
                                       (df_categories['other_low'] == 1) & (df_categories['other_up'] == 0),
                                       (df_categories['other_low'] == 0) & (df_categories['other_up'] == 1),
                                       (df_categories['other_low'] == 0) & (df_categories['other_up'] == 0)],
                                      [0,0,1,0],
                                      default = 0)

#Convert everything to float
df_categories = df_categories.astype(float)

#Define categories to loop on
categories = ['other_low'] + top_c1 + ['Bars'] + top_c2 + ['other_up']

#Define distances to loop on
dist_lst = [
    [5000,int(np.max(distances))],
    [2500,5000],
    [1000,2500],
    [500,1000],
    [250,500],
    [100,250],
    [50,100],
    [0,50],
    [-50,0],
    [-100,-50],
    [-250,-100],
    [-500,-250],
    [-1000,-500],
    [-2500,-1000],
    [-5000,-2500],
    [int(np.min(distances)),-5000]
]

#Define features to loop on
features = ['stars','rev_stars','cool','funny','useful','positive_comments','negative_comments','age','polarity','subjectivity'] 

#Define empty arrays
All = np.empty((distances.shape[0],0)) #For inner loop on cat and distance

#Extract category boolean vector for each category
for cat in categories:
    category = np.array(df_categories[cat]) #Create numpy array
    category = np.tile(category,(distances.shape[0],1)) #Reshape as distances martix
    
    #Loop through distance bins to create distance and category mask
    for dist in dist_lst:
        distance_mask = (distances >= dist[0]) & (distances < dist[1])
        mask = np.multiply(distance_mask,category)
        
        #Loop through features to apply such calculation to the mask
        for feature in features:
            stars = np.tile(df[feature].values,(distances.shape[0],1))
            values = np.multiply(mask,stars)
            values[values==0]=np.nan
            means = np.nan_to_num(np.nanmean(values[:,:],axis=1))
            means = np.where(means!=0,means - df[feature].values,0)
            means = means.reshape(means.shape[0],1)
            
            #Assign to values unique array
            All = np.concatenate((All, means),axis=1)
  
#Reshape all the array into the right format for the CNN
All = All.reshape((distances.shape[0],
                   len(categories),
                   len(dist_lst),
                   len(features)))

#Save the dataset
np.save('./numpy_arrays/cnn_dataset_bars',All)

#Save the target variable
np.save('./numpy_arrays/target_bars',df['is_closed'].values)

#Save the observation's id
np.save('./numpy_arrays/ids_bars',np.array(df.index))

#Save the Main category id (Restaurants)
np.save('./numpy_arrays/bars_category',df_categories['Bars'].values)

  


Adding Cafes ...

In [191]:
#Create category similarity
similarity = model.most_similar('Cafes',topn = 15)
top_c = [x for x,y in similarity]
#Store 0ther categories in a list
categories = list(model.wv.vocab.keys())
low_c = []
for x in categories:
    if x in top_c:
        pass
    else:
        low_c.append(x)

#Shuffle list
random.shuffle(low_c)
random.shuffle(low_c)
random.shuffle(low_c)

#Pop the restaurants category
low_c.remove('Cafes')
low_c.remove('')

#Split into two groups
top_c1 = top_c[:int(len(top_c)/2)]
top_c2 = top_c[int(len(top_c)/2):]

#Split into two groups
low_c1 = low_c[:int(len(low_c)/2)]
low_c2 = low_c[int(len(low_c)/2):]

#Create the category boolean for "other", low and up
df_categories['other_low'] = df_categories[low_c1].fillna(0).sum(axis=1).apply(lambda x: 1 if x>0 else 0)
df_categories['other_up'] = df_categories[low_c2].fillna(0).sum(axis=1).apply(lambda x: 1 if x>0 else 0)

#Create categories for top
df_categories['top'] = df_categories[top_c].fillna(0).sum(axis=1).apply(lambda x: 1 if x>0 else 0)

#Rewrite the other categories with zero if first category if the business is already captured in the "top"
df_categories['other_low'] = np.select([(df_categories['top'] == 1) & (df_categories['other_low'] == 1),
                                        (df_categories['top'] == 1) & (df_categories['other_low'] == 0),
                                        (df_categories['top'] == 0) & (df_categories['other_low'] == 1),
                                        (df_categories['top'] == 0) & (df_categories['other_low'] == 0)],
                                       [0,0,1,0],
                                       default = 0)

df_categories['other_up'] = np.select([(df_categories['top'] == 1) & (df_categories['other_up'] == 1),
                                        (df_categories['top'] == 1) & (df_categories['other_up'] == 0),
                                        (df_categories['top'] == 0) & (df_categories['other_up'] == 1),
                                        (df_categories['top'] == 0) & (df_categories['other_up'] == 0)],
                                       [0,0,1,0],
                                       default = 0)

df_categories['other_up'] = np.select([(df_categories['other_low'] == 1) & (df_categories['other_up'] == 1),
                                       (df_categories['other_low'] == 1) & (df_categories['other_up'] == 0),
                                       (df_categories['other_low'] == 0) & (df_categories['other_up'] == 1),
                                       (df_categories['other_low'] == 0) & (df_categories['other_up'] == 0)],
                                      [0,0,1,0],
                                      default = 0)

#Convert everything to float
df_categories = df_categories.astype(float)

#Define categories to loop on
categories = ['other_low'] + top_c1 + ['Shopping'] + top_c2 + ['other_up']

#Define distances to loop on
dist_lst = [
    [5000,int(np.max(distances))],
    [2500,5000],
    [1000,2500],
    [500,1000],
    [250,500],
    [100,250],
    [50,100],
    [0,50],
    [-50,0],
    [-100,-50],
    [-250,-100],
    [-500,-250],
    [-1000,-500],
    [-2500,-1000],
    [-5000,-2500],
    [int(np.min(distances)),-5000]
]

#Define features to loop on
features = ['stars','rev_stars','cool','funny','useful','positive_comments','negative_comments','age','polarity','subjectivity'] 

#Define empty arrays
All = np.empty((distances.shape[0],0)) #For inner loop on cat and distance

#Extract category boolean vector for each category
for cat in categories:
    category = np.array(df_categories[cat]) #Create numpy array
    category = np.tile(category,(distances.shape[0],1)) #Reshape as distances martix
    
    #Loop through distance bins to create distance and category mask
    for dist in dist_lst:
        distance_mask = (distances >= dist[0]) & (distances < dist[1])
        mask = np.multiply(distance_mask,category)
        
        #Loop through features to apply such calculation to the mask
        for feature in features:
            stars = np.tile(df[feature].values,(distances.shape[0],1))
            values = np.multiply(mask,stars)
            values[values==0]=np.nan
            means = np.nan_to_num(np.nanmean(values[:,:],axis=1))
            means = np.where(means!=0,means - df[feature].values,0)
            means = means.reshape(means.shape[0],1)
            
            #Assign to values unique array
            All = np.concatenate((All, means),axis=1)
  
#Reshape all the array into the right format for the CNN
All = All.reshape((distances.shape[0],
                   len(categories),
                   len(dist_lst),
                   len(features)))

#Save the dataset
np.save('./numpy_arrays/cnn_dataset_cafes',All)

#Save the target variable
np.save('./numpy_arrays/target_cafes',df['is_closed'].values)

#Save the observation's id
np.save('./numpy_arrays/ids_cafes',np.array(df.index))

#Save the Main category id (Restaurants)
np.save('./numpy_arrays/cafes_category',df_categories['Cafes'].values)

  
