# ____________________________________________________________________

## TASK 1: Data Identification and collection

The API I chose is of CityBikes which tells the data of the 615 bike across europe.

In [125]:
# importing the required library
import requests # library to request data from API
import pandas as pd # library to convert raw data in to tabular data or data frame
import json # library to work with json data format
import csv # library to work with csv files format

Requesting data from API now:

In [126]:
api = "http://api.citybik.es/v2/networks" # link to get data from API
response = requests.get(api)  # requesting data from link

print(response.status_code) # checking the status of the response from the aPI

200


The answer obtained from the status is 200 this means that the request is successful and API is ready to provide data. Now printing data in json format because the API only supports json file format.

In [127]:
data = response.json()['networks']; # storing data from API in a variable


In [128]:
df = pd.DataFrame(data) # converting data in to data frames

print(type(df))
print(len(df))

<class 'pandas.core.frame.DataFrame'>
615


Now checking the details of our data frame which tells the data type of the feature and the count of their existense. From this we can see that if our data set is complete or if it has any missing values in the features.

In [129]:
df.info() # to check  the details about all the features or attributes in the data along with their data type. it also tells if the data set is complete or not

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 615 entries, 0 to 614
Data columns (total 9 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   company    603 non-null    object
 1   href       615 non-null    object
 2   id         615 non-null    object
 3   location   615 non-null    object
 4   name       615 non-null    object
 5   source     158 non-null    object
 6   gbfs_href  105 non-null    object
 7   license    27 non-null     object
 8   ebikes     24 non-null     object
dtypes: object(9)
memory usage: 43.4+ KB


In [130]:
df.isnull().sum() # this command shows that how many null values are there for each data index

company       12
href           0
id             0
location       0
name           0
source       457
gbfs_href    510
license      588
ebikes       591
dtype: int64

In [131]:
df.describe() # to describe the counts of the features and how the data for that feature would look like

Unnamed: 0,company,href,id,location,name,source,gbfs_href,license,ebikes
count,603,615,615,615,615,158,105.0,27,24
unique,86,615,615,612,407,136,104.0,5,1
top,[Nextbike GmbH],/v2/networks/velobike-moscow,velobike-moscow,"{'city': 'Poznań', 'country': 'PL', 'latitude'...",Nextbike,https://developer.jcdecaux.com,,"{'name': 'Open Licence', 'url': 'https://devel...",True
freq,253,1,1,2,59,23,2.0,23,24


From the results above we can see that the data set contains the data of 615 bikes. There are 86 companies in total for bikes as it is shown in the data above there are 86 uniques names of companies in the bike data. Almost every bike has unique location because the longitude and latitude will be different for every bike. 

In [132]:
df.head() # showing some of the top of the list data frames

Unnamed: 0,company,href,id,location,name,source,gbfs_href,license,ebikes
0,[ЗАО «СитиБайк»],/v2/networks/velobike-moscow,velobike-moscow,"{'city': 'Moscow', 'country': 'RU', 'latitude'...",Velobike,,,,
1,[Urban Infrastructure Partner],/v2/networks/baerum-bysykkel,baerum-bysykkel,"{'city': 'Bærum', 'country': 'NO', 'latitude':...",Bysykkel,,,,
2,[Comunicare S.r.l.],/v2/networks/bicincitta-siena,bicincitta-siena,"{'city': 'Siena', 'country': 'IT', 'latitude':...",Bicincittà,https://www.bicincitta.com/frmLeStazioni.aspx?...,,,
3,[Cyclopolis Systems],/v2/networks/cyclopolis-maroussi,cyclopolis-maroussi,"{'city': 'Maroussi', 'country': 'GR', 'latitud...",Cyclopolis,,,,
4,"[Groundwork, Slough Borough Council, ITS]",/v2/networks/cycle-hire-slough,cycle-hire-slough,"{'city': 'Slough', 'country': 'GB', 'latitude'...",Cycle Hire,,,,


Now seperating the keys and values from our json data 

In [133]:
features = df.iloc[0,:].index # getting the attributes or headers
print(features) # attributes

Index(['company', 'href', 'id', 'location', 'name', 'source', 'gbfs_href',
       'license', 'ebikes'],
      dtype='object')


In [134]:
valuess = df.iloc[0,:].values # getting the values under the headers
print(valuess) # values

[list(['ЗАО «СитиБайк»']) '/v2/networks/velobike-moscow' 'velobike-moscow'
 {'city': 'Moscow', 'country': 'RU', 'latitude': 55.75, 'longitude': 37.616667}
 'Velobike' nan nan nan nan]


The keys or indexes of the data are found. It was found that 2 of the 8 index have the value which is another dictionary while the other indexes have single value of string data type. 
Here, I opened those dictioanries and put the keys as the variables and then made another dictionary which contains data for these 6 indexes and the data from other two 2 indexes in the form of another 6 variable. This expanded the final output.

In [135]:
final_list = [] # list to store the final list of data for one bike
    
for d in range(len(df)): 
    temp_list = [] # list to store value data from data frame
    temp_list2 = [] # a custom list to append the data as needed in the final final list
    
    # variables defined which were in the dictionary format in our obtained data in json format
    location_city = ""
    location_country = ""
    location_longitude = 0
    location_latitude = 0
    license_name = ""
    license_url = ""
    
    
    temp_list = df.iloc[d,:].values # storing the obtained value in a list
    
    # stroring each value of the list in a variable
    company = str(temp_list[0])
    href  = temp_list[1]
    id = temp_list[2]
    location = temp_list[3]
    name = temp_list[4]
    source = temp_list[5]
    gbfs_href = temp_list[6]
    license = temp_list[7]
    ebikes = temp_list[8]
    
    
    # variables for location index (which was in dictionary format in the obtained data)
    if pd.isna(location) == False: # if the value in not empty
        location_city = location.get("city")
        location_country = location.get("country")
        location_longitude = location.get("longitude") # float
        location_latitude = location.get("latitude") # float
    
    # varibale for license index (which was in dictionary format in the obtained data)
    if pd.isna(license) == False: # if the value in not empty
        license_name = license.get("name")
        license_url = license.get("url")
    
    # appending each variable seperately is a custom list
    temp_list2.append(company)
    temp_list2.append(href)
    temp_list2.append(id)
    temp_list2.append(location_city)
    temp_list2.append(location_country)
    temp_list2.append(location_longitude)
    temp_list2.append(location_latitude)
    temp_list2.append(name)
    temp_list2.append(source)
    temp_list2.append(gbfs_href)
    temp_list2.append(license_name)
    temp_list2.append(license_url)
    temp_list2.append(ebikes)
     
    final_list.append(temp_list2) # appending the custom list in a final list which now a complete data of one bike in one index of the list
    
print(final_list[0])

["['ЗАО «СитиБайк»']", '/v2/networks/velobike-moscow', 'velobike-moscow', 'Moscow', 'RU', 37.616667, 55.75, 'Velobike', nan, nan, '', '', nan]


In [136]:
# defining the list of headers we want in the final csv file
headings = ["Company", "href", "ID", "Location_City", "Location_Country", "Location_Longitude", "Location_Latitude", "Name", "Source", "gbfs_href", "License_name", "License_url", "ebikes"]
with open('bike.csv', 'w', encoding="utf-8", newline='') as f: # putting file name | writing mode selected | encoding standard = utf-8 | newline paratmeter is used to avoid the double spacing between rows
    writer = csv.writer(f)
    
    writer.writerow(headings) # writing a row of headings
    
    for d in range(len(df)):
        writer.writerow(final_list[d]) # putting of all 615 bikes under the heading respectively

The data is now saved in the csv file. The next python notebook file will be used to laod this file and pre-processing will be done there.

### Challenges faced in collecting data:

Though the data collection was easy as all the data was in json format but the values in the columns or features of the data set were in different data types. Some values in the feature were in the form of list. This part was challenging. To make all the data set easier to visualize, I expanded and opened that list in the main data set. Like the feature 'Location' had the value in list to I expanded the list in 4 values and integrate these 4 values in teh main data set. Same was done for 'License' feature. This method increased the features or columns but the data was easy to read and I was able to extract the information easily.