In [1]:
from IPython.core.display import HTML
HTML("""
<style>

div.cell { /* Tunes the space between cells */
margin-top:0.5em;
margin-bottom:0.5em;
}

div.text_cell_render h1 { /* Main titles bigger, centered */
font-size: 2.2em;
line-height:1.4em;
text-align:center;
}


div.text_cell_render { /* Customize text cells */
font-family: 'IBM Plex Sans';
font-size:1.2em;
line-height:1.4em;
padding-left:2em;
padding-right:3em;
}
</style>
""")

## Identifying spots for Family Welfare Hoardings <br /> Real Life Data Science Problem

Web-scrapping, Foursquare API, Folium Map and more

One of the biggest issues that India is faced with is population explosion. The resources that are generated within the country, whether it be food products, industrial products or otherwise are not enough to cater to the needs of the ever-growing population, leading to scarcity of resources, spiralling prices of food products etc.. While the agricultural production is growing in arithmetic progression, Indian population is growing in geometric progression!! Due to the ever-rising population, the Government is also not able to spend enough on welfare measures like health, education etc., to be able to cover the entire population or even to cover the needy / people below poverty line as their numbers also increase multifold with increase in population. The myth of the boy child has also contributed to increase in population to a great extent.

### 1.	Discussion and Background of the Business Problem:

Problem Statement: The Indian Government has been taking lot of steps to control the population and spending a lot of money on population control measures. Recently the Health and Family Welfare department of the Government of India has introduced a new measure to reward below poverty line / middle class families with Single Daughters by providing them with free education upto Graduation and scholarships for studies beyond graduation. To be able to get the scheme publicised, the department is planning to display eye catching hoardings in thickly populated places / places with high foot traffic, listing the various points of the schemes and how the eligible families would benefit from the same. <br /> Since the funding available for publicising the scheme is limited, the objective of this project would be to identify the strategic points for the display which would have the highest cost-benefit effect. <br /> So the department has decided that the funding would be utilised to put up 25 hoardings in 50 cities which had the highest increase in population during the previous 2 census periods viz., 2011 vs 2010.

#### Target Audience:

1.The Health and Family Welfare Department of the Government of India. <br /> 2.The publicity companies who would win the tender to put up the hoardings. <br  /> 3.Any corporates who would offer to put up those hoardings as part of their CSR activity.<br /> 4. Budding Data Scientists, who want to implement some of the most used Exploratory Data Analysis techniques to obtain necessary data, analyze it, and finally be able to tell a story out of it.

### 2. Data Preparation:

#### 2.1 Scrapping 

I first made use of the Census data for 2001 and 2011 for the major India cities available in the link https://en.wikipedia.org/wiki/List_of_cities_in_India_by_population. I have scrapped the table to create a data frame. For this, I have used requests and Beautifulsoup4 library to create a data-frame containing the name of the 300 cities with their Rank, Name, Population in 2001 and 2011 census and the name of the State of Union Territory to which the city belongs.

In [2]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

url = 'https://en.wikipedia.org/wiki/List_of_cities_in_India_by_population'
website_text = requests.get(url).text
soup = BeautifulSoup(website_text, 'xml')
table = soup.find('table',{'class':'wikitable sortable'})
table_rows = table.find_all('tr')
data = []
for row in table_rows:
    data.append([t.text.strip() for t in row.find_all('td')])

df_raw = pd.DataFrame(data, columns=['Rank','City','Population(2011)','Population(2001)','State or union territory'])
df_raw.head()

Unnamed: 0,Rank,City,Population(2011),Population(2001),State or union territory
0,,,,,
1,1.0,Mumbai,12442373.0,11978450.0,Maharashtra
2,2.0,Delhi,11007835.0,9879172.0,Delhi
3,3.0,Bangalore,8436675.0,6537124.0,Karnataka
4,4.0,Hyderabad,6809970.0,3637483.0,Telangana


##### After some manipulation, to exclude invalid rows, include calculation for increase in population in 2011 compared to 2001 in absolute numbers and % etc., the table looks as below:


In [3]:
df_raw = df_raw[~df_raw['Rank'].isnull()]
df_raw = df_raw.drop('Rank',axis=1)
df_filtered = df_raw.drop(df_raw[df_raw['Population(2001)']=='―'].index)
df_new = df_filtered.replace(',','',regex=True)
df_new['Population(2011)'] = df_new['Population(2011)'].astype(float)
df_new['Population(2001)'] = df_new['Population(2001)'].astype(float)
df_new['Increase'] = df_new['Population(2011)']-df_new['Population(2001)']
df_new['Increase%']=df_new['Increase']/df_new['Population(2001)']
neworder = ['City','State or union territory','Population(2011)','Population(2001)','Increase','Increase%']
df_new = df_new.reindex(columns=neworder)
df_new = df_new.sort_values('Increase%',ascending=False)
df_top50 = df_new.nlargest(50,['Increase','Increase%'])
df_top50 = df_top50.reset_index(drop=True)
df_top50['City'] = df_top50['City'].str.split('[').str[0]
df_top50['City'] = df_top50['City'].str.split('-').str[0]
df_top50['Address'] = df_top50['City']+', India'
df_top50

Unnamed: 0,City,State or union territory,Population(2011),Population(2001),Increase,Increase%,Address
0,Hyderabad,Telangana,6809970.0,3637483.0,3172487.0,0.872165,"Hyderabad, India"
1,Ahmedabad,Gujarat,5570585.0,3520085.0,2050500.0,0.582514,"Ahmedabad, India"
2,Surat,Gujarat,4467797.0,2433835.0,2033962.0,0.835703,"Surat, India"
3,Bangalore,Karnataka,8436675.0,6537124.0,1899551.0,0.290579,"Bangalore, India"
4,Delhi,Delhi,11007835.0,9879172.0,1128663.0,0.114247,"Delhi, India"
5,Visakhapatnam,Andhra Pradesh,1728128.0,982904.0,745224.0,0.758186,"Visakhapatnam, India"
6,Jaipur,Rajasthan,3046163.0,2322575.0,723588.0,0.311546,"Jaipur, India"
7,Pimpri,Maharashtra,1727692.0,1012472.0,715220.0,0.70641,"Pimpri, India"
8,Gurgaon,Haryana,876824.0,173542.0,703282.0,4.052518,"Gurgaon, India"
9,Ghaziabad,Uttar Pradesh,1648643.0,968256.0,680387.0,0.702693,"Ghaziabad, India"


#### 2.2 Getting Coordinates of Major Cities : Geopy Client

Next objective is to get the coordinates of these 50 major cities using <br />geocoder class of Geopy client

In [4]:
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent='India explorer')
df_top50['Latitude'] = df_top50['Address'].apply(geolocator.geocode).apply(lambda x: x.latitude)
df_top50['Longitude'] = df_top50['Address'].apply(geolocator.geocode).apply(lambda x: x.longitude)
df_top50

Unnamed: 0,City,State or union territory,Population(2011),Population(2001),Increase,Increase%,Address,Latitude,Longitude
0,Hyderabad,Telangana,6809970.0,3637483.0,3172487.0,0.872165,"Hyderabad, India",17.388786,78.461065
1,Ahmedabad,Gujarat,5570585.0,3520085.0,2050500.0,0.582514,"Ahmedabad, India",23.021624,72.579707
2,Surat,Gujarat,4467797.0,2433835.0,2033962.0,0.835703,"Surat, India",21.186461,72.808128
3,Bangalore,Karnataka,8436675.0,6537124.0,1899551.0,0.290579,"Bangalore, India",12.97912,77.5913
4,Delhi,Delhi,11007835.0,9879172.0,1128663.0,0.114247,"Delhi, India",28.651718,77.221939
5,Visakhapatnam,Andhra Pradesh,1728128.0,982904.0,745224.0,0.758186,"Visakhapatnam, India",17.723128,83.301284
6,Jaipur,Rajasthan,3046163.0,2322575.0,723588.0,0.311546,"Jaipur, India",26.916194,75.820349
7,Pimpri,Maharashtra,1727692.0,1012472.0,715220.0,0.70641,"Pimpri, India",20.797091,76.32622
8,Gurgaon,Haryana,876824.0,173542.0,703282.0,4.052518,"Gurgaon, India",28.464615,77.029919
9,Ghaziabad,Uttar Pradesh,1648643.0,968256.0,680387.0,0.702693,"Ghaziabad, India",28.711241,77.444537


### 3. Visualisation and Data Exploration

In [5]:
address = 'India'
geolocator = Nominatim(user_agent="India_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of India are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of India are 22.3511148, 78.6677428.


In [6]:
import matplotlib.cm as cm
import matplotlib.colors as colors

# import k-means from clustering stage
from sklearn.cluster import KMeans

!conda install -c conda-forge folium=0.5.0 --yes
import folium # map rendering library

Solving environment: done

# All requested packages already installed.



In [7]:
# create map of New York using latitude and longitude values
map_india = folium.Map(location=[latitude, longitude], zoom_start=5.25)

# add markers to map
for lat, lng, city, state in zip(df_top50['Latitude'], df_top50['Longitude'], df_top50['City'], df_top50['State or union territory']):
    label = '{}, {}'.format(city, state)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lng],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.7,
        parse_html=False).add_to(map_india)  
    
map_india

### 4. Plot Venues using Foursquare for each city

In [8]:
CLIENT_ID = 'IKQA4XDS5FMGR3ADWAEY055AU5OX1XF1DHOLJQXF1GB5KGPR' # your Foursquare ID
CLIENT_SECRET = 'G2IU3NDY4QFW2OE2VLHAMXOCIPNPVJKPT2U31QAVCNCKMUWF' # your Foursquare Secret
VERSION = '20190831' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: IKQA4XDS5FMGR3ADWAEY055AU5OX1XF1DHOLJQXF1GB5KGPR
CLIENT_SECRET:G2IU3NDY4QFW2OE2VLHAMXOCIPNPVJKPT2U31QAVCNCKMUWF


In [9]:
LIMIT = 100
def getNearbyVenues(names, latitudes, longitudes, radius=2000):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()['response']['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['City', 
                  'Latitude', 
                  'Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

In [10]:
majcities_venues = getNearbyVenues(names=df_top50['City'],
                                   latitudes=df_top50['Latitude'],
                                   longitudes=df_top50['Longitude']
                                  )


Hyderabad
Ahmedabad
Surat
Bangalore
Delhi
Visakhapatnam
Jaipur
Pimpri
Gurgaon
Ghaziabad
Lucknow
Pune
Thane
Tiruppur
Vasai
Indore
Mumbai
Navi Mumbai
Nashik
Raipur
Loni
Vadodara
Bhopal
Faridabad
Nagpur
Chennai
Noida
Jalgaon
Firozabad
Erode
Rajkot
Patna
Agra
Kota
Aurangabad
Mira
Srinagar
Warangal
Meerut
Saharanpur
Gwalior
Guntur
Ranchi
Nellore
Ludhiana
Kadapa
Kanpur
Aligarh
Bhubaneswar
Kolkata


In [18]:
print(majcities_venues.shape)
majcities_venues.head()

(1332, 9)


Unnamed: 0,City,Latitude,Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,Count,CumCount
0,Hyderabad,17.388786,78.461065,Subhan Bakery,17.392412,78.464712,Bakery,1,1
1,Hyderabad,17.388786,78.461065,Cafe Niloufer & Bakers,17.399715,78.462881,Café,1,2
2,Hyderabad,17.388786,78.461065,Laxman Ki Bandi,17.378895,78.463973,South Indian Restaurant,1,3
3,Hyderabad,17.388786,78.461065,Famous Ice Cream,17.384321,78.474796,Ice Cream Shop,1,4
4,Hyderabad,17.388786,78.461065,Karachi Bakery,17.383454,78.475075,Bakery,1,5


In [12]:
majcities_venues.groupby('City').count()

Unnamed: 0_level_0,Latitude,Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
City,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Agra,10,10,10,10,10,10
Ahmedabad,46,46,46,46,46,46
Aligarh,4,4,4,4,4,4
Bangalore,100,100,100,100,100,100
Bhopal,12,12,12,12,12,12
Bhubaneswar,25,25,25,25,25,25
Chennai,28,28,28,28,28,28
Delhi,82,82,82,82,82,82
Erode,6,6,6,6,6,6
Faridabad,26,26,26,26,26,26


In [22]:
majcities_venues['Count'] = 1
majcities_venues['CumCount']=majcities_venues['Count'].groupby(majcities_venues['City']).cumsum()
majcities_venues_25 = majcities_venues[majcities_venues['CumCount']<=25]
majcities_venues_25.shape

(707, 9)

In [23]:
majcities_venues_25.tail()

Unnamed: 0,City,Latitude,Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category,Count,CumCount
1262,Kolkata,22.567746,88.347602,Calcutta Swimming Club,22.568067,88.34207,Pool,1,21
1263,Kolkata,22.567746,88.347602,Tantra,22.553843,88.351459,Nightclub,1,22
1264,Kolkata,22.567746,88.347602,Raj's Spanish Cafe,22.558344,88.354178,Café,1,23
1265,Kolkata,22.567746,88.347602,Bar-B-Q,22.553125,88.352625,BBQ Joint,1,24
1266,Kolkata,22.567746,88.347602,Blue Sky Cafe,22.558188,88.352932,Restaurant,1,25


#### There are 707 venues in total for the 50 cities as of 6.45 PM on 9th Sep, 2019, with a maximum of 25 venues per city. This would change based on the time at which the query is run.

### 5. Map venues pertaining to the Hyderbad City on map as an example as plotting for all cities on one map is not viewable

In [14]:
address = 'Hyderabad, India'
geolocator = Nominatim(user_agent="India_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of India are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of India are 17.38878595, 78.4610647345315.


In [15]:
Hyderabad_venues_25 = majcities_venues_25[majcities_venues_25['City'] == 'Hyderabad'].reset_index(drop=True)

In [24]:
# create map of Hyderabad using latitude and longitude values
map_Hyderabad = folium.Map(location=[latitude, longitude], zoom_start=12)

for lat1, lng1, venue, city in zip(Hyderabad_venues_25['Venue Latitude'], Hyderabad_venues_25['Venue Longitude'], Hyderabad_venues_25['Venue'], Hyderabad_venues_25['City']):
    label = '{}, {}'.format(venue, city)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat1, lng1],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='blue',
        fill_opacity=0.7,
        parse_html=False).add_to(map_Hyderabad)  


map_Hyderabad

### 6. Results and Discussion:

The Department of Health and Family Welfare of the Government of India just had an idea to use hoardings to publicise its new welfare schemed aimed at controlling population, especially targeting the boy child myth issue, and had requested for an Data Science based solution to identify the best spots for placing such hoardings so they catch the eyes of the maximum population around the area. I have used data from web resources like Wikipedia, python libraries like Geopy, and Foursquare API, to get to the list of 25 most frequented sites in the Top 50 cities. When run at 6.45pm on a Monday, 9th Sep, 2019, I found 707 such venues (with a maximum of 25 venues per city, with some cities having less than 25 venues).. I would try to run the same query at a different time to see how the number varies, as the venues extracted with Foursquare explore option differs with time.

### 7. Conclusion

Finally to conclude this project, We have got a small glimpse of how real life data-science projects look like. I’ve made use of some frequently used python libraries to scrap web-data, use Foursquare API to explore the major cities of India with highest population increase and saw the results using Folium leaflet map. Potential for this kind of analysis in a real life business problem is discussed in great detail. Since the venues identified in this exploration are the ones highly frequented by people, setting up hoardings to publicise the new scheme of the Government for family welfare in turn aimed at controlling population are expected to be highly effective. Hopefully, this kind of analysis will provide you initial guidance to take more real-life challenges using data-science.