# Content Library
* Importing Libraries and Data
* Data Wrangling
* Data Cleaning
* Creating Chloropleth Map

# 1 Import Libraries and Data

In [1]:
#libraries
import pandas as pd
import numpy as np
import seaborn as sns
import os
import matplotlib
import folium
import json

In [2]:
#dataset
df=pd.read_csv(r'C:\Users\Samth\CareerFoundry Projects\Achievement 6\1 Data\Manipulated Data\userapi_clean.csv')

In [3]:
#import json file for geographical component
geomap=r'C:\Users\Samth\CareerFoundry Projects\Achievement 6\1 Data\Manipulated Data\custom.geo.json'

In [4]:
#prompt for visuals to view in notebook
%matplotlib inline

# 2 Data Wrangling

In [5]:
#viewing dataset columns
df.columns

Index(['Unnamed: 0', 'gender', 'genderLooking', 'age', 'counts_details',
       'counts_pictures', 'counts_profileVisits', 'counts_kisses',
       'counts_fans', 'counts_g', 'flirtInterests_chat',
       'flirtInterests_friends', 'flirtInterests_date', 'country', 'city',
       'location', 'distance', 'isFlirtstar', 'isHighlighted', 'isInfluencer',
       'isMobile', 'isNew', 'isOnline', 'isVip', 'lang_count', 'lang_fr',
       'lang_en', 'lang_de', 'lang_it', 'lang_es', 'lang_pt', 'verified',
       'shareProfileEnabled', 'lastOnlineDate', 'lastOnlineTime', 'userId'],
      dtype='object')

In [6]:
#creating subset with columns needed only for analysis/map 
df2=df.drop(columns=['Unnamed: 0','counts_details','counts_pictures','counts_g','flirtInterests_chat','flirtInterests_friends','flirtInterests_date','distance','isFlirtstar','isHighlighted','isInfluencer','isMobile','isNew','isOnline','isVip','lang_count','lang_fr','lang_en','lang_de','lang_it','lang_es','lang_pt','verified','shareProfileEnabled'])

In [7]:
#viewing new subset with needed columns
df2.head()

Unnamed: 0,gender,genderLooking,age,counts_profileVisits,counts_kisses,counts_fans,country,city,location,lastOnlineDate,lastOnlineTime,userId
0,F,M,25,8279,239,0,CH,Rothenburg,Rümlang,2015-04-25T20:43:26Z,1429995000.0,55303fc3160ba0eb728b4575
1,F,M,22,663,13,0,CH,Sissach,Sissach,2015-04-26T09:19:35Z,1430040000.0,552e7b61c66da10d1e8b4c82
2,F,M,21,1369,88,0,CH,Bâle,Bâle,2015-04-06T14:24:07Z,1428330000.0,54a584ecc56da128638b4674
3,F,none,20,22187,1015,2,CA,Montréal,Berne,2015-04-07T11:21:01Z,1428406000.0,54c92738076ea1b5338b4735
4,F,M,21,35262,1413,9,DE,Rastatt,Rastatt,2015-04-06T14:25:20Z,1428330000.0,54e1a6f6c76da135748b4a3a


In [7]:
#creating flag for user age groups
df2.loc[df2['age'] >= 25, 'Age Category']= 'Late Twenties'

In [8]:
df2.loc[df2['age'] < 25, 'Age Category']= 'Early Twenties'

In [9]:
#viewing count and data type of new category
df2['Age Category'].value_counts(dropna=False)

Age Category
Early Twenties    3146
Late Twenties      426
Name: count, dtype: int64

#### We learn new valuable information that 88% of the women in the dataset are younger woman in their early twenties, under the age of 25. By using data we already have and our newly wrangled data, the goal is to create a map that either shows how the age demographic is spread across countries, or the popularity of users across different countries. Possibly both!

In [10]:
#viewing subset with new flags
df2.head()

Unnamed: 0,gender,genderLooking,age,counts_profileVisits,counts_kisses,counts_fans,country,city,location,lastOnlineDate,lastOnlineTime,userId,Age Category
0,F,M,25,8279,239,0,CH,Rothenburg,Rümlang,2015-04-25T20:43:26Z,1429995000.0,55303fc3160ba0eb728b4575,Late Twenties
1,F,M,22,663,13,0,CH,Sissach,Sissach,2015-04-26T09:19:35Z,1430040000.0,552e7b61c66da10d1e8b4c82,Early Twenties
2,F,M,21,1369,88,0,CH,Bâle,Bâle,2015-04-06T14:24:07Z,1428330000.0,54a584ecc56da128638b4674,Early Twenties
3,F,none,20,22187,1015,2,CA,Montréal,Berne,2015-04-07T11:21:01Z,1428406000.0,54c92738076ea1b5338b4735,Early Twenties
4,F,M,21,35262,1413,9,DE,Rastatt,Rastatt,2015-04-06T14:25:20Z,1428330000.0,54e1a6f6c76da135748b4a3a,Early Twenties


# 3 Data Cleaning

In [10]:
#checking for any missing values in the dataset
df2.isnull().sum()

gender                  0
genderLooking           0
age                     0
counts_profileVisits    0
counts_kisses           0
counts_fans             0
country                 0
city                    0
location                0
lastOnlineDate          0
lastOnlineTime          0
userId                  0
Age Category            0
dtype: int64

In [11]:
#checking for duplicates
dups=df2.duplicated()

In [12]:
dups.shape #no duplicates found

(3572,)

In [14]:
#checking for extreme values in kisses/likes column
df2[df2['counts_kisses'] >2000]

Unnamed: 0,gender,genderLooking,age,counts_profileVisits,counts_kisses,counts_fans,country,city,location,lastOnlineDate,lastOnlineTime,userId,Age Category
8,F,M,20,29984,2389,10,CH,Gstaad,Steffisburg,2015-04-07T20:01:55Z,1428437000.0,550c8310066ea13f808b4b35,Early Twenties
12,F,M,22,31736,2102,6,DE,Neu-Ulm,Holzheim,2015-04-06T14:52:17Z,1428332000.0,54fa217d190ba0a1618b4668,Early Twenties
30,F,M,24,51339,2926,1,CH,Erlinsbach,Erlinsbach (AG),2015-04-08T10:38:36Z,1428490000.0,552025c01c0ba01e1b8b4588,Early Twenties
60,F,M,19,48980,2358,28,CH,Würenlingen,Würenlingen,2015-04-26T11:04:52Z,1430046000.0,5525ad66140ba01c5b8b496d,Early Twenties
62,F,M,20,33528,3679,32,CH,Basel,Basel,2015-04-08T15:19:24Z,1428506000.0,55038f65c96da15c998b4a4c,Early Twenties
76,F,M,20,51560,6155,18,CH,Siglistorf,Steinmaur,2015-04-26T11:47:45Z,1430049000.0,55380277c66da1264a8b47ef,Early Twenties
90,F,M,21,52465,3739,4,CH,Fribourg,Schwarzsee,2015-04-26T10:45:05Z,1430045000.0,553a3ff7086ea1d37d8b50bc,Early Twenties
97,F,M,24,43796,2750,2,CH,"Emmenbrücke (Lucerna, Schweiz)",Emmen,2015-04-26T06:07:35Z,1430028000.0,54c34090190ba0a5428b4ec5,Early Twenties
112,F,none,19,29903,2146,0,CH,Zürich,Zürich,2015-04-26T09:14:13Z,1430040000.0,55076dc41b0ba0d9208b478a,Early Twenties
113,F,M,23,50275,3687,0,DE,Bergisch Gladbach,Bergisch Gladbach,2015-04-06T14:52:29Z,1428332000.0,55205176160ba031038b4bbe,Early Twenties


#### Although there are some extreme values in the dataset, I will only change the values of those values higher than 5000 kisses as it is not representative of the typical user's kisses history

In [13]:
#creating new check for extreme values higher than 5000
df2[df2['counts_kisses']>5000]

Unnamed: 0,gender,genderLooking,age,counts_profileVisits,counts_kisses,counts_fans,country,city,location,lastOnlineDate,lastOnlineTime,userId,Age Category
76,F,M,20,51560,6155,18,CH,Siglistorf,Steinmaur,2015-04-26T11:47:45Z,1430049000.0,55380277c66da1264a8b47ef,Early Twenties
155,F,M,23,164425,9288,0,DE,Köln,Köln,2015-04-19T11:42:53Z,1429444000.0,55229ec5ea6da1d7038b46a3,Early Twenties


In [14]:
#attempting to replace values
df2['counts_kisses'] = np.where(df2['counts_kisses']>5000, [5000],df['counts_kisses'])

In [18]:
#viewing desc stats for column 
df2.head()

Unnamed: 0,gender,genderLooking,age,counts_profileVisits,counts_kisses,counts_fans,country,city,location,lastOnlineDate,lastOnlineTime,userId,Age Category
0,F,M,25,8279,239,0,CH,Rothenburg,Rümlang,2015-04-25T20:43:26Z,1429995000.0,55303fc3160ba0eb728b4575,Late Twenties
1,F,M,22,663,13,0,CH,Sissach,Sissach,2015-04-26T09:19:35Z,1430040000.0,552e7b61c66da10d1e8b4c82,Early Twenties
2,F,M,21,1369,88,0,CH,Bâle,Bâle,2015-04-06T14:24:07Z,1428330000.0,54a584ecc56da128638b4674,Early Twenties
3,F,none,20,22187,1015,2,CA,Montréal,Berne,2015-04-07T11:21:01Z,1428406000.0,54c92738076ea1b5338b4735,Early Twenties
4,F,M,21,35262,1413,9,DE,Rastatt,Rastatt,2015-04-06T14:25:20Z,1428330000.0,54e1a6f6c76da135748b4a3a,Early Twenties


In [15]:
#verifying extreme values were replaced
df2[df2['counts_kisses']>5000] #all extreme values have been replaced

Unnamed: 0,gender,genderLooking,age,counts_profileVisits,counts_kisses,counts_fans,country,city,location,lastOnlineDate,lastOnlineTime,userId,Age Category


In [18]:
#exporting extra manipulated dataset for future records and possible use
df2.to_csv(os.path.join(r'C:\Users\Samth\CareerFoundry Projects\Achievement 6\1 Data\Manipulated Data\userapi_manipulated_geo_python.csv'))

# 4 Plotting a Chloropleth

In [16]:
#creating a dataframe for first map
ageplot= df2[['age','city','location','country']]

In [17]:
#viewing dataframe
ageplot.head()

Unnamed: 0,age,city,location,country
0,25,Rothenburg,Rümlang,CH
1,22,Sissach,Sissach,CH
2,21,Bâle,Bâle,CH
3,20,Montréal,Berne,CA
4,21,Rastatt,Rastatt,DE


In [18]:
#creating map with folium
map = folium.Map(location = [0,0], zoom_start = 0)

folium.Choropleth(
    geo_data = geomap, 
    data = ageplot,
    columns = ['age', 'country'],
    key_on = 'feature.properties.name',
    fill_color = 'RdPu', fill_opacity=0.5, line_opacity=0.2,
    legend_name = "age").add_to(map)

folium.LayerControl().add_to(map)

map

TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

In [19]:
#due to error looking at column types in dataframe to see if there are object values
ageplot.dtypes

age          int64
city        object
location    object
country     object
dtype: object

# 5 Commentary

* As you know my Chloropleth map ended up being made in Tableau as I had issues making it in Python. Below I will be addressing any research questions my map may have answered or questions it may have created!

### Although this was not one of the original research questions, our map was able to show us that a majority of our users are located in Germany, France and Italy respectively. This information can be important for marketing and sales as it will allow us to understand which regions may need more marketing to improve our app's visibility. Due to this new information, it also provides further context to one of our research questions "Do women under 25 or over 25 receive more likes?" Since we know a majority of the users are in these three countries, it would be best practice to answer this question by using data only from these countries, as these users have more regular interactions on the app compared to users in a country with less than 100 profiles.