# Which neighbourhoods is more suitable to open a new Coffee Shop?

---

## About this Notebook  
In this Notebook I tried to perform data visualization on neighbourhoods and its venues of Los Angeles, California. In order to describe this data more vividly I tried to use several different types of plots such as folium.Map.

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

1. [Downloading and Prepping Data](#0)<br>
2. [Distribution of the accidents accross US states (except Alaska)](#2)<br>
3. [Impact of Visibility](#4) <br>
4. [Monthly Distribution](#6) <br>
5. [Hourly Distribution During the Week](#8) <br>
6. [Distribution Along Weekdays](#10) <br>
7. [Analysis on the Recorded Accident Descriptions](#12) <br> 
8. [Impact of the Temperature](#14) <br> 
9. [Quick Check the Impacts of other features](#16) <br> 
10. [Acknowledgements and References](#18) <br> 
</div>
<hr>

# Downloading and Prepping Data <a id="0"></a>

In [1]:
# import libraries which are necessary in this notebook
import numpy as np
import pandas as pd
from os import path
import datetime
import matplotlib
# import folium library
#!pip install folium
from folium import plugins
import folium
import requests # library to handle requests
from pandas.io.json import json_normalize # tranform JSON file into a pandas dataframe
# use Waffle from pywaffle library for waffle plot
#!pip install pywaffle
from pywaffle import Waffle
# Start with loading all necessary libraries
#!pip install Pillow
#!pip install wordcloud
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import matplotlib.pyplot as plt
from matplotlib import cm # color map
print('\nLibraries are imported successfully!')


Libraries are imported successfully!


## Now use Foursquare API to call venues and boroughs in Los Angeles where the most accidents were taken place.

In [2]:
# https://usc.data.socrata.com/dataset/Los-Angeles-Neighborhood-Map/r8qd-yxsr
neigh_path = r'C:/pythonwork/kaggle/data/us_accidents/la_neighborhoods.csv'
neigh_df=pd.read_csv(neigh_path)
neigh_df.head()

Unnamed: 0,set,slug,the_geom,kind,external_i,name,display_na,sqmi,type,name_1,slug_1,latitude,longitude,location
0,L.A. County Neighborhoods (Current),acton,MULTIPOLYGON (((-118.20261747920541 34.5389897...,L.A. County Neighborhood (Current),acton,Acton,Acton L.A. County Neighborhood (Current),39.339109,unincorporated-area,,,-118.16981,34.497355,POINT(34.497355239240846 -118.16981019229348)
1,L.A. County Neighborhoods (Current),adams-normandie,MULTIPOLYGON (((-118.30900800000012 34.0374109...,L.A. County Neighborhood (Current),adams-normandie,Adams-Normandie,Adams-Normandie L.A. County Neighborhood (Curr...,0.80535,segment-of-a-city,,,-118.300208,34.031461,POINT(34.031461499124156 -118.30020800000011)
2,L.A. County Neighborhoods (Current),agoura-hills,MULTIPOLYGON (((-118.76192500000009 34.1682029...,L.A. County Neighborhood (Current),agoura-hills,Agoura Hills,Agoura Hills L.A. County Neighborhood (Current),8.14676,standalone-city,,,-118.759885,34.146736,POINT(34.146736499122795 -118.75988450000015)
3,L.A. County Neighborhoods (Current),agua-dulce,MULTIPOLYGON (((-118.2546773959221 34.55830403...,L.A. County Neighborhood (Current),agua-dulce,Agua Dulce,Agua Dulce L.A. County Neighborhood (Current),31.462632,unincorporated-area,,,-118.317104,34.504927,POINT(34.504926999796837 -118.3171036690717)
4,L.A. County Neighborhoods (Current),alhambra,MULTIPOLYGON (((-118.12174700000014 34.1050399...,L.A. County Neighborhood (Current),alhambra,Alhambra,Alhambra L.A. County Neighborhood (Current),7.623814,standalone-city,,,-118.136512,34.085539,POINT(34.085538999123571 -118.13651200000021)


How many neighbourhoods are there in Los Angeles city? Check it.

In [3]:
neigh_df.shape

(272, 14)

In [4]:
print((100*neigh_df.isnull().sum()/neigh_df.shape[0]).round(2))

set             0.0
slug            0.0
the_geom        0.0
kind            0.0
external_i      0.0
name            0.0
display_na      0.0
sqmi            0.0
type            0.0
name_1        100.0
slug_1        100.0
latitude        0.0
longitude       0.0
location        0.0
dtype: float64


There are **272** neighbourhoods.

In [5]:
neigh_df.drop(['name_1', 'slug_1'], axis = 1, inplace=True)

In [6]:
LA_neighbourhood = neigh_df[['name', 'latitude', 'longitude']].name.values
LA_lat = neigh_df[['name', 'latitude', 'longitude']].latitude.values
LA_lng = neigh_df[['name', 'latitude', 'longitude']].longitude.values

In [7]:
LA_n = neigh_df[['name', 'latitude', 'longitude']]

In [8]:

LA_center = [34.052235, -118.243683]
map_LA = folium.Map(location=LA_center, zoom_start=10)

#folium.Marker(LA_center, popup='LA_Center').add_to(map_LA)
for lon, lat, neighbourhood in zip(LA_lat, LA_lng, LA_neighbourhood):
    label = '{}'.format(neighbourhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color='blue',
        fill=False,
        fill_color='#3186cc',
        fill_opacity=0.5,
        parse_html=False).add_to(map_LA) 
folium.CircleMarker(LA_center, popup='LA_Center', color = 'orange', fill_color='orange', fill=True, fill_opacity=1.0, radius=6).add_to(map_LA)

map_LA

----------

Ruering with Foursquare API  

efine Foursquare Credentials and Version

In [9]:
CLIENT_ID = '4KA2G0XZSTB0RYQRCEKQQB1SFRLNQKIEY4NJJNLKTXMDTOYS' # your Foursquare ID
CLIENT_SECRET = 'QT0JFD3SUYKQDQMSWDFERRHAH4B4BRXCMEZGHYL2YCEMLZOF' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version

print('Your credentails:')
print('CLIENT_ID: ' + CLIENT_ID)
print('CLIENT_SECRET:' + CLIENT_SECRET)

Your credentails:
CLIENT_ID: 4KA2G0XZSTB0RYQRCEKQQB1SFRLNQKIEY4NJJNLKTXMDTOYS
CLIENT_SECRET:QT0JFD3SUYKQDQMSWDFERRHAH4B4BRXCMEZGHYL2YCEMLZOF


Now, let's get the top 100 venues that are in Downtown of Los Angeles within a radius of 500 meters.

First, let's create the GET request URL. Name your URL url.

In [10]:
#neigh_df[['name', 'latitude', 'longitude']]
dwtw_df = neigh_df[neigh_df['name']=='Downtown']


In [11]:
# type your answer here
latitude =  dwtw_df.longitude.values[0]
longitude = dwtw_df.latitude.values[0]
LIMIT = 100
search_query = 'Italian'
radius = 500
url = 'https://api.foursquare.com/v2/venues/explore?client_id={}&client_secret={}&ll={},{}&v={}&radius={}&limit={}'.format(CLIENT_ID, CLIENT_SECRET, latitude, longitude, VERSION, radius, LIMIT)
url

'https://api.foursquare.com/v2/venues/explore?client_id=4KA2G0XZSTB0RYQRCEKQQB1SFRLNQKIEY4NJJNLKTXMDTOYS&client_secret=QT0JFD3SUYKQDQMSWDFERRHAH4B4BRXCMEZGHYL2YCEMLZOF&ll=34.0400086135259,-118.24850990440493&v=20180605&radius=500&limit=100'

In [12]:
results = requests.get(url).json()
results

{'meta': {'code': 200, 'requestId': '5e4e2edfd03993001b1f265a'},
 'response': {'suggestedFilters': {'header': 'Tap to show:',
   'filters': [{'name': '$-$$$$', 'key': 'price'}]},
  'headerLocation': 'Fashion District',
  'headerFullLocation': 'Fashion District, Los Angeles',
  'headerLocationGranularity': 'neighborhood',
  'totalResults': 21,
  'suggestedBounds': {'ne': {'lat': 34.0445086180259,
    'lng': -118.24308949856285},
   'sw': {'lat': 34.0355086090259, 'lng': -118.253930310247}},
  'groups': [{'type': 'Recommended Places',
    'name': 'recommended',
    'items': [{'reasons': {'count': 0,
       'items': [{'summary': 'This spot is popular',
         'type': 'general',
         'reasonName': 'globalInteractionReason'}]},
      'venue': {'id': '4b4221ddf964a52056cd25e3',
       'name': 'Los Angeles Flower Market',
       'location': {'address': '754 Wall St',
        'crossStreet': 'at 7th St',
        'lat': 34.04047777060825,
        'lng': -118.24986706732301,
        'labele

From the Foursquare lab in the previous module, we know that all the information is in the items key. Before we proceed, let's borrow the get_category_type function from the Foursquare lab.

In [13]:
# function that extracts the category of the venue
def get_category_type(row):
    try:
        categories_list = row['categories']
    except:
        categories_list = row['venue.categories']
        
    if len(categories_list) == 0:
        return None
    else:
        return categories_list[0]['name']

Now we are ready to clean the json and structure it into a pandas dataframe.

In [14]:
venues = results['response']['groups'][0]['items']
    
nearby_venues = json_normalize(venues) # flatten JSON

# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]

# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)

# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]

nearby_venues.head()

  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,name,categories,lat,lng
0,Los Angeles Flower Market,Flower Shop,34.040478,-118.249867
1,Poppy + Rose,Breakfast Spot,34.040565,-118.249943
2,Moskatel's,Arts & Crafts Store,34.040795,-118.248456
3,Sonoratown,Taco Place,34.04168,-118.252095
4,Los Angeles Flower District,Neighborhood,34.039336,-118.249496


And how many venues were returned by Foursquare?

In [15]:
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))

21 venues were returned by Foursquare.


## Explore Neighborhoods in Los Angeles

#### Let's create a function to repeat the same process to all the neighborhoods in Los Angeles

In [16]:
def getNearbyVenues(names, latitudes, longitudes, radius=500):
    
    venues_list=[]
    for name, lat, lng in zip(names, latitudes, longitudes):
        #print(name)
            
        # create the API request URL
        url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
            CLIENT_ID, 
            CLIENT_SECRET, 
            VERSION, 
            lat, 
            lng, 
            radius, 
            LIMIT)
            
        # make the GET request
        results = requests.get(url).json()["response"]['groups'][0]['items']
        
        # return only relevant information for each nearby venue
        venues_list.append([(
            name, 
            lat, 
            lng, 
            v['venue']['name'], 
            v['venue']['location']['lat'], 
            v['venue']['location']['lng'],  
            v['venue']['categories'][0]['name']) for v in results])

    nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
    nearby_venues.columns = ['Neighborhood', 
                  'Neighborhood Latitude', 
                  'Neighborhood Longitude', 
                  'Venue', 
                  'Venue Latitude', 
                  'Venue Longitude', 
                  'Venue Category']
    
    return(nearby_venues)

#### Now write the code to run the above function on each neighborhood and create a new dataframe called *Los_Angeles_venues*.

In [17]:
# type your answer here

LA_venues = getNearbyVenues(names=LA_neighbourhood,
                                   latitudes=LA_lng,
                                   longitudes=LA_lat
                                  )
print('\nFinished.')


Finished.


In [18]:
# To avoid the 'Quota Ecxceeded' Error of Foursquare API
LA_venues.to_csv('LA_venues_from_Foursquare.csv')

In [19]:
print(LA_venues.shape)
LA_venues.head()

(3020, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
0,Acton,34.497355,-118.16981,Epik Engineering,34.498718,-118.168046,Construction & Landscaping
1,Acton,34.497355,-118.16981,Alma Gardening Co.,34.494762,-118.17255,Construction & Landscaping
2,Adams-Normandie,34.031461,-118.300208,Orange Door Sushi,34.032485,-118.299368,Sushi Restaurant
3,Adams-Normandie,34.031461,-118.300208,Shell,34.033095,-118.300025,Gas Station
4,Adams-Normandie,34.031461,-118.300208,Sushi Delight,34.032445,-118.299525,Sushi Restaurant


Let's check how many venues were returned for each neighborhood

In [20]:
LA_venues.groupby('Neighborhood').count()

Unnamed: 0_level_0,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
Neighborhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Acton,2,2,2,2,2,2
Adams-Normandie,8,8,8,8,8,8
Agoura Hills,27,27,27,27,27,27
Agua Dulce,1,1,1,1,1,1
Alhambra,13,13,13,13,13,13
...,...,...,...,...,...,...
Willowbrook,4,4,4,4,4,4
Wilmington,12,12,12,12,12,12
Windsor Square,3,3,3,3,3,3
Winnetka,12,12,12,12,12,12


So there are 237 Neighborhoods in Los Angeles

#### Let's find out how many unique categories can be curated from all the returned venues

In [21]:
print('There are {} uniques categories.'.format(len(LA_venues['Venue Category'].unique())))

There are 319 uniques categories.


In [22]:
LA_venues_coffee = LA_venues[LA_venues['Venue Category']=='Coffee Shop']
# To avoid the 'Quota Ecxceeded' Error of Foursquare API
LA_venues_coffee.set_index('Neighborhood')
print(LA_venues_coffee.shape)
LA_venues_coffee.head()

(89, 7)


Unnamed: 0,Neighborhood,Neighborhood Latitude,Neighborhood Longitude,Venue,Venue Latitude,Venue Longitude,Venue Category
159,Atwater Village,34.131066,-118.262373,Starbucks,34.129278,-118.258659,Coffee Shop
190,Azusa,34.13747,-117.912469,Starbucks,34.13567,-117.9075,Coffee Shop
336,Beverly Grove,34.076633,-118.376102,Starbucks,34.074911,-118.375322,Coffee Shop
396,Koreatown,34.06451,-118.304958,Bia Coffee,34.06358,-118.308221,Coffee Shop
418,Koreatown,34.06451,-118.304958,Starbucks,34.061339,-118.306407,Coffee Shop


Look at these Coffee Shops how they are distributed in Los Angeles.

In [23]:
LA_center = [34.052235, -118.243683]
map_LA_Coffee = folium.Map(location=LA_center, zoom_start=10)
folium.CircleMarker(LA_center, popup='LA_Center', color = 'green', fill_color='green', fill=True, fill_opacity=0.1, radius=320).add_to(map_LA_Coffee)
folium.CircleMarker(LA_center, popup='LA_Center', color = 'orange', fill_color='orange', fill=True, fill_opacity=0.7, radius=10).add_to(map_LA_Coffee)


#folium.Marker(LA_center, popup='LA_Center').add_to(map_LA)
for lon, lat, neighbourhood in zip(LA_venues_coffee.iloc[:, 5].values, LA_venues_coffee.iloc[:, 4].values, LA_venues_coffee.iloc[:, 3].values):
    label = '{}'.format(neighbourhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color='blue',
        fill=True,
        fill_color='#3186cc',
        fill_opacity=0.5,
        parse_html=False).add_to(map_LA_Coffee) 
    
for lon, lat, neighbourhood in zip(LA_lat, LA_lng, LA_neighbourhood):
    label = '{}'.format(neighbourhood)
    label = folium.Popup(label, parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=1,
        popup=label,
        color='red',
        fill=False,
        fill_color='#3186cc',
        fill_opacity=0.5,
        parse_html=False).add_to(map_LA_Coffee) 


map_LA_Coffee

## Distance formula (with Haversine formula)

```python
from math import radians, sin, cos, asin, sqrt
def haversine(lon1, lat1, lon2, lat2):
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])    dlon = lon2 - lon1
    dlat = lat2 - lat1    a = sin(dlat / 2) ** 2 + cos(lat1) * cos(lat2) * sin(dlon / 2) ** 2
    return 2 * 6371 * asin(sqrt(a))
```

# Impact of Visibility <a id="4"></a>

Let us check if the visibility has any noticeable impact on the occurance of accidents.

In [24]:
# bins=300
# plt.figure(figsize=(10, 6))

# for st in ['CA', 'TX', 'FL', 'SC', 'NC', 'NY']:    
#     # set s filter
#     stfilt = (df_new['State'] == st)
#     plt.hist(df_new.loc[stfilt,'Visibility(mi)'], bins, density=False)
# plt.xlabel('Visibility(mi)', fontsize=14)
# plt.ylabel('Number of accidents', fontsize=14)
# plt.xlim(0,15)
# plt.grid()
# plt.show()

We can see that the impact of the visibility on the number of accidents is not significant. For instance, in state 'CA', California, occured the largest number of accidents among other states. However, the visibility of theses states are almost same around 10 mi.

# Monthly Distribution <a id="6"></a>

Let us perform some timely analysis

In [25]:
# def which_day(date_time):
#     '''
#     To find out which weekday according to given timestamp with the format 'yyyy-mm-dd hh:mm:ss'
#         input: datetime string with the format of 'yyyy-mm-dd hh:mm:ss'
#         return: nth day of the week
#     '''
#     # import time and date modules
#     from datetime import datetime
#     # import calendae modules to extract the exact weekday
#     import calendar
#     try:
#         if type(date_time) is str:
#             my_string=date_time.split(' ')[0]
#             my_date = datetime.strptime(my_string, "%Y-%m-%d")
#             return my_date.weekday()
#         else:
#             raise Exception("'date_time' has unexpected data type, it is expected to be a sting")

#     except Exception as e:
#         print(e)
# # use above function to find which weekday 
# nth_day=[]
# date_time=[dt for dt in df_new['Start_Time']]
# for i in range(len(date_time)):
#     nth_day.append(which_day(date_time[i]))
# # add four new columns 'year', 'month', 'hour', 'weekday'
# df_new['year'] = pd.DatetimeIndex(df_new['Start_Time']).year
# df_new['month'] = pd.DatetimeIndex(df_new['Start_Time']).month
# df_new['hour'] = pd.DatetimeIndex(df_new['Start_Time']).hour
# df_new['weekday']=nth_day

lets check the shape of the new dataset

In [26]:
# df_new.shape

In [27]:
# df_new.loc[:,['year', 'month', 'hour', 'weekday', 'Start_Time']].head()

In [28]:
# df_month=df_new[df_new['year'].isin(['2016','2017', '2018', '2019'])].groupby(['month'], as_index=False).count().iloc[:,:2]
# # by changing the argument in 'isin()' one can look at quite directly the change of the accidents during the years,
# # which I did not do it here.
# df_month.head()

In [29]:
# # plot data in bar chart
# ax=df_month.plot(kind='bar', width=0.8, figsize=(10, 6), legend=None)
# xtick_labels=['Jan.', 'Feb.', 'Mar.', 'Apr.', 'May', 'Jun.', 'Jul.', 'Aug.', 'Sep.', 'Oct.', 'Nov.', 'Dec.']
# ax.set_xticks(list(df_month.index))
# ax.set_xticklabels(xtick_labels)
# ax.set_xlabel('Month', fontsize=14) # add to x-label to the plot
# ax.set_ylabel('Number of Accidents', fontsize=14) # add y-label to the plot
# ax.set_title('Number of accidents by each month', fontsize=14) # add title to the plot
# plt.show()

In [30]:
# wday_filt = (df_new['weekday'].isin([0, 1, 2, 3, 4]))#.to_frame()
# weekend_filt = (df_new['weekday'].isin([5, 6]))#.to_frame()
# df_wday = (df_new.loc[wday_filt])[['hour']]#.count().iloc[:, :2]
# df_weekend = (df_new.loc[weekend_filt])[['hour']]#.count().iloc[:, :2]

# Hourly Distribution <a id="8"></a>

In [31]:
# # plot the distribution of accidents during the day
# fig, axes = plt.subplots(nrows=3, ncols=1, figsize=(6, 12), sharex=True)
# ax0, ax1, ax2 = axes.flatten()
# bins=24
# kwargs = dict(bins=24, density=False, histtype='stepfilled', linewidth=3)
# # ax0
# ax0.hist(list(df_new['hour']),  **kwargs, color='orange', label='Whole week')
# ax0.set_ylabel('Number of accidents', fontsize=14)
# # ax1
# ax1.hist(list(df_wday['hour']), **kwargs, color='blue', label='Work days')
# ax1.set_ylabel('Number of accidents', fontsize=14)
# # ax2
# ax2.hist(list(df_weekend['hour']),  **kwargs, color='Red', label='Only weekend')
# ax2.set_ylabel('Number of accidents', fontsize=14)
# ax2.set_xlabel('Hour', fontsize=14)
# ax0.legend(); ax1.legend(); ax2.legend()
# plt.xlim(0, 23)
# #plt.ylim(0, 2.5e5)
# plt.show()


Most of the accidents were during the day time, especially **around rush hours both inthe mornings and afternoons of wor days**. At weekends there are relatively less accidents and most of these accidents are occured from **7:00 AM to 9:00 PM**.

# Distribution Along Weekdays <a id="10"></a>

In [32]:
# df_weekday=df_new.groupby(['weekday'], as_index=False).count().iloc[:,:2]
# # set the month as the index
# df_weekday.set_index('weekday', inplace=True)

In [33]:
# # plot data in bar chart
# labels = ['Mo', 'Tu', 'We', 'Th', 'Fr', 'Sa', 'Su']
# x = np.arange(len(labels))  # the label locations
# fig, ax = plt.subplots(figsize=(10, 6))
# ax1 = ax.bar(x, df_weekday['ID'], width=0.5)
# #ax1 = ax.plot(x, df_weekday['ID'],marker='o', lw=2)
# # Add some text for labels, title and custom x-axis tick labels, etc.
# ax.set_ylabel('Number of accidents', fontsize=14)
# ax.set_xlabel('Weekday', fontsize=14)
# ax.set_title('Distribution of accidents along the weekdays', fontsize=14)
# ax.set_xticks(x)
# ax.set_xticklabels(labels)

# #df_weekday.plot(kind='line', figsize=(10, 6), legend=None)

# #plt.xlabel('Weekday', fontsize=14) # add to x-label to the plot
# #plt.ylabel('Number of Accidents', fontsize=14) # add y-label to the plot
# #plt.title('Number of accidents by each state', fontsize=14) # add title to the plot
# plt.show()

From the above plot we can clearly see that there are relatively less accidents on weekends.

# Analysis on the Recorded Accident Descriptions <a id="12"></a> 

In [34]:
# !pip install Pillow
# !pip install wordcloud

In [35]:
# # join all descriptions from all accidents
# dsc=df_new['Description'].astype(str)
# # remove non-words
# #sanitized_text = ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t]) |(\w+:\/\/\S+)", " ", text).split()) 
# text = " ".join(desc for desc in dsc)
# print ("There are {} words in the combination of all description.".format(len(text)))

In [36]:
# more_stopwords=["accident", "due", "blocked", "Right", "hand"]
# for more in more_stopwords:
#     STOPWORDS.add(more)
# # Generate a word cloud image
# # lower max_font_size
# wordcloud = WordCloud(stopwords=STOPWORDS, max_font_size=40, background_color="white").generate(text)
# plt.figure(figsize=(18, 10))
# plt.imshow(wordcloud, interpolation="bilinear")
# plt.axis("off")
# plt.show
# # Save the image in the img folder:
# wordcloud.to_file("us_accidents_description.png")

From this WordCloud one can see clearly that there are more accidents around highways (the size of the words directly proportional to their frequencies of appearence in the recorded description of the accidents.) and around the roads of smaller neighborhoods of USA. 

# Impact of the Temperature <a id="14"></a>

In [37]:
# df_T=df_new['Temperature(F)'].values

In [38]:
# '''
# # lambda function 
# ftoc=lambda f:5/9*(f-32)
# # function call
# c=[]
# for fi in f:
#     c.append(round(ftoc(ni), 1))
# c=np.array(c)
# c
# '''
# num_bins = 50

# fig = plt.figure(figsize=(10, 6))
# ax1 = fig.add_subplot(111)

# # the histogram of the data
# n, bins, patches = ax1.hist(df_T, num_bins, density=0) # set density=1 to normalize
# # find bincenters
# # bincenters = 0.5*(bins[1:]+bins[:-1])


# ax1.set_xlabel(r"Temperature(°F)", fontsize=14, color='red')
# ax1.set_ylabel('Number of accidents', fontsize=14, color='red')
# ax1.set_xlim(-25, 125) # set xlim 
# # Set the temperature in celisius
# ax2 = ax1.twiny()
# ax2.set_xlabel(r"Temperature(°C)", fontsize=14, color='red')
# ax2.set_xlim(ax1.get_xlim())
# ax2.set_xticks([-58, -13, 32, 77, 122])
# ax2.set_xticklabels(['-50', '-25', '0','25', '50'])
# plt.grid()
# plt.show()

This histogram tells us that most of the accidents were happened when the weather were neither too hot nor too cold to go out.

# Quick Check the Impacts of other features <a id="16"></a> 

In [39]:
# 100*df.Severity.value_counts()/df.shape[0]

Accidents with severity level 1, indicates the least impact on traffic (i.e., short delay as a result of the accident), occured barely; with severity level 4,  indicates a significant impact on traffic (i.e., long delay), occured little number of times; with severity level 2 and 3, indicate the impact on traffic is around in midlevel, were occured quite frequently.

In [40]:
# df.Stop.value_counts()

'Stop' feature does have little impact.

In [41]:
# df['Sunrise_Sunset'].value_counts()

'Sunrise_Sunset' does have considerable impact that around **26.2%** of whole accidents occured during the night.

In [42]:
# df['Traffic_Signal'].value_counts()

The 'Traffic_Signal' has also some impact like such that around **16.9%** of whole accidents occured nearby traffic signal locations.

In [43]:
# df['Give_Way'].value_counts()

This 'Give_Way', indicates traffic signs / rules, hardly have impact on traffic accidents.

# Acknowledgements and References <a id="18"></a> 

I would like to thank the provider of this dataset! This is my very first Kaggle project, even though there are no any prediction and machine learning type work. I am thinking a way to use classification based on some features who have noticeable impact on the the accidents. 

During the process of making myself familiar with this dataset I have found two published papers specifically on this dataset. If some one wants more information, please look at the following papers.

References:  
1. arXiv:1906.05409  
2. arXiv:1909.09638


--------------THE END------------