<a href="https://colab.research.google.com/github/DimitriosTagkoulis/Clustering-Stock-Movements/blob/master/DataPrep.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Prep

## Importing

In [1]:
#Connect to the Google Driver
from google.colab import drive
# Extras
%matplotlib inline

import warnings
warnings.filterwarnings("ignore")
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
import gzip
import pandas as pd 

# Unzip zipped file
with gzip.open('/content/drive/MyDrive/Colab_projects/Data_Mining/Project/Raw_Data/listings.csv.gz', 'rb') as listings:
 # Pass the unziped file to pandas
   dfRaw = pd.read_csv(listings)

Used [MyGeodata](https://mygeodata.cloud/) to convert the kml downloaded from [Google Maps](https://www.google.com/maps/d/u/0/viewer?ie=UTF8&oe=UTF8&dg=feature&msa=0&mid=1Uq7DL2Qt8S3jMWCtzhLhv34YZ84&ll=38.01603792916025%2C23.79033106347657&z=10) to the resulted file `Transportation.csv`.

In [3]:
dfTransport = pd.read_csv('/content/drive/MyDrive/Colab_projects/Data_Mining/Project/Data_Prep/Transportation_Metro_Tram.csv')
dfTransport = dfTransport[dfTransport['geometry/type'] != 'LineString']
dfTransport.drop(columns=['type','properties/description', 'geometry/type', 'properties/tessellate'], inplace = True)
dfTransport.rename(columns={'properties/Name': 'Name', 'geometry/coordinates/0' : 'longitude', 'geometry/coordinates/1' : 'latitude' }, inplace=True)
dfTransport.dropna(how='all', axis=1, inplace=True)
dfTransport.reset_index(inplace=True)
dfTransport.drop(columns='index', inplace=True)
dfTransport.head()

Unnamed: 0,Name,longitude,latitude
0,Neos Kosmos (Νέος Κόσμος‎),23.727947,37.957471
1,Faliro (Φάληρο) / S.E.F. (Σ.Ε.Φ.),23.664551,37.944198
2,Piraeus (Πειραιάς),23.639188,37.947616
3,"International Airport ""Eleftherios Venizelos"" ...",23.952599,37.940542
4,Larissa Station,23.720652,37.992341


Used [MyGeodata](https://mygeodata.cloud/) to convert the kml downloaded from [Google Maps](https://www.google.com/maps/d/u/0/viewer?ie=UTF8&t=h&oe=UTF8&msa=0&mid=1oEiURG0UyGJBnMErK3DTtwzsvJo&ll=38.02091428513228%2C23.7600215&z=13) to the resulted file `Attractions.csv`

In [4]:
dfAttractions = pd.read_csv('/content/drive/MyDrive/Colab_projects/Data_Mining/Project/Data_Prep/Attractions.csv')
#dfAttractions.drop(columns=['description', 'gid', 'tessellate'], inplace = True)

dfAttractions.head()

Unnamed: 0,longitude,latitude,Location
0,23.73567,37.975989,Sytagma_Square
1,23.733963,37.97567,Ermou_Street
2,23.729435,37.982571,Stadiou_Avenue
3,23.73053,37.98318,Panepistimiou_Eleftheriou_Venizelou_Avenue
4,23.743306,37.981885,Lycabetttus_Hill


We will use these dataframes to create new features based on the distance of these locations from the apartments

In [5]:
appartmentLocations = dfRaw[['id', 'longitude', 'latitude']].copy()
appartmentLocations

Unnamed: 0,id,longitude,latitude
0,10595,23.765270,37.988630
1,10990,23.764480,37.989030
2,10993,23.764730,37.988880
3,10995,23.764480,37.989030
4,27262,23.765000,37.989240
...,...,...,...
9577,52959003,23.728438,37.976986
9578,52959885,23.731117,37.955988
9579,52959925,23.723520,37.985283
9580,52960132,23.730460,37.987990


Used [MyGeodata](https://mygeodata.cloud/) to convert the kml downloaded from [Google Maps](https://www.google.com/maps/d/u/0/viewer?ie=UTF8&t=h&oe=UTF8&msa=0&mid=1oEiURG0UyGJBnMErK3DTtwzsvJo&ll=38.02091428513228%2C23.7600215&z=13) to the resulted file `Attractions.csv`.

In [6]:
# Distance between two points of erath function
from math import radians, cos, sin, asin, sqrt 
def map_distance(lat1, lat2, lon1, lon2):
  
    # The math module contains a function named
    # radians which converts from degrees to radians.
    lon1 = radians(lon1)
    lon2 = radians(lon2)
    lat1 = radians(lat1)
    lat2 = radians(lat2)
      
    # Haversine formula
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
 
    c = 2 * asin(sqrt(a))
    
    # Radius of earth in kilometers
    r = 6371
      
    # calculate the result
    return(c * r)

# Min distance from station function
def nearest_transport_station_distance(row, stationsDf):
   global distances
   distances = []
   for j in range(len(stationsDf)):
      rowLat = row['latitude']
      rowLon = row['longitude']
      station = stationsDf.loc[j]
      stationLat = station['latitude']
      stationLon = station['longitude']
      distance = map_distance(rowLat, stationLat, rowLon, stationLon)
      distances.append(distance)
      min_dist = min(distances)
      distances.clear()
   return min_dist

In [7]:
# Get the distance from the nearest transport station in the column 'nearest_station_distance'

for i in appartmentLocations.index:

   appartmentLocations.at[i, 'nearest_station_distance'] = nearest_transport_station_distance(appartmentLocations.loc[i], dfTransport)

In [8]:
# Create one column per Unique location and find it's distance
for i in dfAttractions.Location.index:
   name =  dfAttractions.loc[i].Location
   for row in appartmentLocations.index:
      appartmentLocations.at[row, 'distance_from_' + name] = map_distance(appartmentLocations.loc[row]['latitude'], dfAttractions.loc[i]['latitude'], appartmentLocations.loc[row]['longitude'], dfAttractions.loc[i]['longitude'])

In [9]:
appartmentLocations

Unnamed: 0,id,longitude,latitude,nearest_station_distance,distance_from_Sytagma_Square,distance_from_Ermou_Street,distance_from_Stadiou_Avenue,distance_from_Panepistimiou_Eleftheriou_Venizelou_Avenue,distance_from_Lycabetttus_Hill,distance_from_The_Parliament,distance_from_Platia_Filikis_Eterias,distance_from_Kolonaki,distance_from_National_Library_of_Greece,distance_from_The_new_Acropolis_museum,distance_from_National_Archaeological_Museum,distance_from_Keramikos,distance_from_The_Acropolis,distance_from_Dionysiou_Aeropagitou,distance_from_Plaka,distance_from_Voukourestiou_Street,distance_from_Exarchia,distance_from_Panathenaic_Stadium,distance_from_Dromeas,distance_from_Mitropoleos_Square,distance_from_Vasilissis_Sofias_Avenue,distance_from_Benaki_Museum,distance_from_Calatrava's_pedestrian_bridge,distance_from_Eleftheria_Park,distance_from_Church_of_Kapnikarea,distance_from_Olympic_Athletic_Center_of_Athens,distance_from_Kifisia
0,10595,23.765270,37.988630,8.288646,2.950583,3.099286,3.212032,3.104327,2.065886,2.886630,2.541786,2.258001,2.881140,3.921319,2.854622,4.852770,3.888144,4.272533,3.640993,3.013813,2.670001,3.080872,1.923042,3.509716,2.924393,2.590226,1.095094,1.434913,3.494269,5.623390,10.159793
1,10990,23.764480,37.989030,8.278029,2.911781,3.059500,3.154191,3.045635,2.018615,2.851311,2.507554,2.224787,2.829221,3.890789,2.785070,4.795962,3.849994,4.233582,3.602674,2.972142,2.605198,3.063183,1.905525,3.466668,2.885518,2.557047,1.136339,1.413766,3.448477,5.609524,10.144744
2,10993,23.764730,37.988880,8.283654,2.922588,3.070662,3.171802,3.063551,2.032335,2.860895,2.516729,2.233575,2.844676,3.898769,2.807018,4.813241,3.860622,4.244505,3.613351,2.983961,2.625350,3.066708,1.908848,3.478999,2.896348,2.565860,1.123964,1.418203,3.461777,5.616179,10.151847
3,10995,23.764480,37.989030,8.278029,2.911781,3.059500,3.154191,3.045635,2.018615,2.851311,2.507554,2.224787,2.829221,3.890789,2.785070,4.795962,3.849994,4.233582,3.602674,2.972142,2.605198,3.063183,1.905525,3.466668,2.885518,2.557047,1.136339,1.413766,3.448477,5.609524,10.144744
4,27262,23.765000,37.989240,8.237345,2.962929,3.110677,3.203904,3.095166,2.069701,2.902287,2.558444,2.275585,2.879896,3.941381,2.830752,4.845837,3.901127,4.284743,3.653811,3.023345,2.652987,3.111787,1.953992,3.517868,2.936668,2.607854,1.085159,1.462844,3.499608,5.570072,10.105839
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9577,52959003,23.728438,37.976986,11.084893,0.643504,0.505889,0.627147,0.712735,1.412413,0.782680,1.070358,1.326753,0.642844,0.952041,1.388211,1.505689,0.621375,0.913272,0.470390,0.524896,1.212656,1.325538,1.840482,0.204908,0.662281,1.052970,4.566572,2.171518,0.066632,8.432761,12.815280
9578,52959885,23.731117,37.955988,12.897647,2.259593,2.202765,2.959619,3.024100,3.071509,2.200178,2.435484,2.650041,2.772621,1.400444,3.674550,3.052053,1.782814,1.759713,1.876330,2.317644,3.422917,1.678061,2.803968,2.144485,2.278565,2.374849,5.738352,3.299330,2.280130,10.216862,14.713951
9579,52959925,23.723520,37.985283,10.651897,1.483933,1.407255,0.599749,0.657384,1.774798,1.633147,1.786324,1.946726,1.037863,1.926724,0.905003,1.297653,1.543393,1.672761,1.469480,1.352474,1.009327,2.289012,2.465157,1.220429,1.488882,1.800048,4.703864,2.634277,1.082309,8.051185,12.310227
9580,52960132,23.730460,37.987990,10.023203,1.410413,1.403904,0.609225,0.534883,1.314660,1.535094,1.552398,1.619170,0.852667,2.181615,0.226715,1.971277,1.854241,2.092357,1.697400,1.305751,0.418484,2.237397,2.086085,1.427370,1.403074,1.587266,4.053012,2.147154,1.300370,7.406368,11.706299


In [10]:
dfRaw.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,description,neighborhood_overview,picture_url,host_id,host_url,host_name,host_since,host_location,host_about,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_thumbnail_url,host_picture_url,host_neighbourhood,host_listings_count,host_total_listings_count,host_verifications,host_has_profile_pic,host_identity_verified,neighbourhood,neighbourhood_cleansed,neighbourhood_group_cleansed,latitude,longitude,property_type,room_type,accommodates,bathrooms,bathrooms_text,bedrooms,beds,amenities,price,minimum_nights,maximum_nights,minimum_minimum_nights,maximum_minimum_nights,minimum_maximum_nights,maximum_maximum_nights,minimum_nights_avg_ntm,maximum_nights_avg_ntm,calendar_updated,has_availability,availability_30,availability_60,availability_90,availability_365,calendar_last_scraped,number_of_reviews,number_of_reviews_ltm,number_of_reviews_l30d,first_review,last_review,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,10595,https://www.airbnb.com/rooms/10595,20211025162728,2021-10-26,"96m2, 3BR, 2BA, Metro, WI-FI etc...",Athens Furnished Apartment No6 is 3-bedroom ap...,Ampelokipi district is nice multinational and ...,https://a0.muscache.com/pictures/f7e19a44-5afe...,37177,https://www.airbnb.com/users/show/37177,Emmanouil,2009-09-08,"Athens, Attica, Greece",Athens Quality Apartments is a company started...,,,,t,https://a0.muscache.com/im/pictures/user/859c1...,https://a0.muscache.com/im/pictures/user/859c1...,Ambelokipi,6.0,6.0,"['email', 'phone', 'reviews', 'jumio', 'govern...",t,t,"Athens, Attica, Greece",ΑΜΠΕΛΟΚΗΠΟΙ,,37.98863,23.76527,Entire rental unit,Entire home/apt,8,,2 baths,3.0,5.0,"[""Kitchen"", ""Free street parking"", ""Crib"", ""Pa...",$79.00,1,1125,2,8,1125,1125,2.3,1125.0,,t,19,49,79,170,2021-10-26,32,7,0,2015-05-25,2019-04-04,4.77,4.81,4.75,4.84,4.84,4.5,4.66,957568,t,6,6,0,0,0.41
1,10990,https://www.airbnb.com/rooms/10990,20211025162728,2021-10-25,Athens Quality Apartments - Deluxe Apartment,Athens Quality Apartments - Deluxe apartment i...,Ampelokipi district is nice multinational and ...,https://a0.muscache.com/pictures/8645179/c1728...,37177,https://www.airbnb.com/users/show/37177,Emmanouil,2009-09-08,"Athens, Attica, Greece",Athens Quality Apartments is a company started...,,,,t,https://a0.muscache.com/im/pictures/user/859c1...,https://a0.muscache.com/im/pictures/user/859c1...,Ambelokipi,6.0,6.0,"['email', 'phone', 'reviews', 'jumio', 'govern...",t,t,"Athens, Attica, Greece",ΑΜΠΕΛΟΚΗΠΟΙ,,37.98903,23.76448,Entire rental unit,Entire home/apt,4,,1 bath,1.0,1.0,"[""Kitchen"", ""Luggage dropoff allowed"", ""Free s...",$50.00,1,1125,1,8,1125,1125,1.5,1125.0,,t,26,56,86,361,2021-10-25,52,12,1,2015-11-25,2016-02-22,4.86,4.94,4.9,4.9,4.92,4.82,4.82,1070920,t,6,6,0,0,0.72
2,10993,https://www.airbnb.com/rooms/10993,20211025162728,2021-10-25,Athens Quality Apartments - Studio,The Studio is an <br />-excellent located <br ...,Ampelokipi district is nice multinational and ...,https://a0.muscache.com/pictures/107309527/848...,37177,https://www.airbnb.com/users/show/37177,Emmanouil,2009-09-08,"Athens, Attica, Greece",Athens Quality Apartments is a company started...,,,,t,https://a0.muscache.com/im/pictures/user/859c1...,https://a0.muscache.com/im/pictures/user/859c1...,Ambelokipi,6.0,6.0,"['email', 'phone', 'reviews', 'jumio', 'govern...",t,t,"Athens, Attica, Greece",ΑΜΠΕΛΟΚΗΠΟΙ,,37.98888,23.76473,Entire rental unit,Entire home/apt,2,,1 bath,,1.0,"[""Kitchen"", ""Free street parking"", ""Patio or b...",$38.00,1,1125,1,8,1125,1125,2.2,1125.0,,t,15,26,56,331,2021-10-25,71,19,3,2015-10-18,2018-03-31,4.85,4.91,4.94,4.97,4.97,4.83,4.83,957080,t,6,6,0,0,0.97
3,10995,https://www.airbnb.com/rooms/10995,20211025162728,2021-10-25,"AQA-No2 1-bedroom, smart tv, fiber connection,","AQA No2 is 1-bedroom apartment (47m2), on the ...",Ampelokipi district is nice multinational and ...,https://a0.muscache.com/pictures/6a565613-aaa3...,37177,https://www.airbnb.com/users/show/37177,Emmanouil,2009-09-08,"Athens, Attica, Greece",Athens Quality Apartments is a company started...,,,,t,https://a0.muscache.com/im/pictures/user/859c1...,https://a0.muscache.com/im/pictures/user/859c1...,Ambelokipi,6.0,6.0,"['email', 'phone', 'reviews', 'jumio', 'govern...",t,t,"Athens, Attica, Greece",ΑΜΠΕΛΟΚΗΠΟΙ,,37.98903,23.76448,Entire rental unit,Entire home/apt,4,,1 bath,1.0,2.0,"[""Kitchen"", ""Free street parking"", ""Patio or b...",$48.00,1,1125,1,8,1125,1125,1.5,1125.0,,t,22,52,82,357,2021-10-25,24,1,0,2015-12-05,2016-08-06,4.79,4.95,4.91,4.91,4.87,4.77,4.77,957422,t,6,6,0,0,0.33
4,27262,https://www.airbnb.com/rooms/27262,20211025162728,2021-10-26,"54m2, 1-br, cable tv, wi-fi, metro",Big 1-bedroom apartment that can accommodate 4...,,https://a0.muscache.com/pictures/8651803/4b82b...,37177,https://www.airbnb.com/users/show/37177,Emmanouil,2009-09-08,"Athens, Attica, Greece",Athens Quality Apartments is a company started...,,,,t,https://a0.muscache.com/im/pictures/user/859c1...,https://a0.muscache.com/im/pictures/user/859c1...,Ambelokipi,6.0,6.0,"['email', 'phone', 'reviews', 'jumio', 'govern...",t,t,,ΑΜΠΕΛΟΚΗΠΟΙ,,37.98924,23.765,Entire rental unit,Entire home/apt,4,,1 bath,1.0,1.0,"[""Kitchen"", ""Free street parking"", ""Crib"", ""Pa...",$47.00,1,1125,1,8,1125,1125,1.8,1125.0,,t,0,27,57,208,2021-10-26,17,0,0,2015-11-12,2017-05-15,4.76,4.81,4.94,4.94,5.0,4.69,4.63,957579,t,6,6,0,0,0.23


In [11]:
from pandas_profiling import ProfileReport
import pandas_profiling
from pandas_profiling.utils.cache import cache_file

In [12]:
profile = ProfileReport(dfRaw, title="Pandas Profiling Report", explorative=True)
# Save the report for future use
#profile.to_file(output_file = '/content/drive/MyDrive/Colab_projects/Data_Mining/Project/Data_Prep/Raw_Data_Profiling_Report')
profile.to_notebook_iframe()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
# Sentiment and readability analysis on description and Name
# ! pip install SpacyTextBlob
import spacy
from spacytextblob.spacytextblob import SpacyTextBlob

In [None]:
nlp = spacy.load('en_core_web_sm')
nlp.add_pipe('spacytextblob')
text = 'I had a really horrible day. It was the worst day ever! But every now and then I have a really good day that makes me happy.'
doc = nlp(text
doc._.polarity
doc._.subjectivity
doc._.assessments