# Collecting Data

So far, we have been using pandas to work with data. But have you thought, how (or from where) did we get the data in the first place. In this chapter, we will be discuss all about collecting / gathering data. Lets get started . . .

There are 3 major ways of getting data. We will discuss them one by one.

## 1. Downloading data

The first and the most obvious way is to download it. There are many websites which allows you to download data for free. 
- There are many online dataset repositories. You can download datasets for free.
    - [Kaggle](https://www.kaggle.com)
    - [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets.php)
    - [Google datasets](https://datasetsearch.research.google.com/)
    - [Amazon Dataset](https://registry.opendata.aws/)
    
    
- Many organizations/sites offers data in processed form. For example, IMDB. You can download a subset of IMDB database from [here](https://www.imdb.com/interfaces/).
- Many colleges and universites publish the data that they used for research. Anyone can download the data for education purpose.
- And finally, you can also download your personal data from sites like netflix, google, etc. You can perform analysis on your personal data. How cool is that!

Most of the datasets are in **CSV** or **TSV format**. You can download them by following the instructions mentioned in the respective sites. And then, you can read them using pandas. 

Long story short, there are many resources for downloading datasets. You don't have to worry about building your own dataset. Finally, here is a great repository to find a lot more datasets. Its called [Awesome Public Datasets](https://github.com/awesomedata/awesome-public-datasets).

## 2. Application Programming Interface (APIs)
Some website providers offer Application Programming Interfaces (APIs) allowing you to access data in a predefined manner. With APIs, you can access the data using formats like JSON and XML. APIs are great way for accessing and downloading data. Almost all social media sites provide APIs for accessing data. Twitter, facebook, Instagram, spotify, etc.

Here is a very simple API that would help you understand APIs better. Its called **The Rick and Morty API** and its based on the famous "Rick and Morty" show. 

Some of our mentors at **aiadventures** love this show. If you are also a fan, then make sure to get in touch😁

## Rick & Morty API

First step would be to read the [documentation](https://rickandmortyapi.com/documentation) to understand "how to use the APIs", and "what all data is present in the response".

Try copy pasting this `https://rickandmortyapi.com/api` URL in your browser. What do you see?

What you see in your browser is called the **response**. In fact, every page that you see in your browser is a response from the server. 

Coming back, our response contains information about all available API's resources. All requests (in this API) are *GET* requests and go over *HTTPS*. All responses will return data in *JSON*.

```{note} 
If you are hearing http request, response, json, etc. for the first time, then we would recommend you to google them, and spend some time reading about them.
```
Now let's use `requests` library to get the response programmatically. `requests` is only library which is used while working with webpages, it can do many things like it can tell if the webpage of the given URL exists in the server or not, also if the webpage exists it can gives the content of the webpage.

In [94]:
import requests
import pandas as pd
response = requests.get('https://rickandmortyapi.com/api')

The response from the server is saved in `response` variable. The `response` object has many attributes. 

To check the response status, you can say . . .

In [95]:
response.status_code

200

*200* means the request was successfully served. These are called **HTTP response status codes**. You can read more about them, [here](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status).

Lets look at the response text . . . 

In [96]:
response.text

'{"characters":"https://rickandmortyapi.com/api/character","locations":"https://rickandmortyapi.com/api/location","episodes":"https://rickandmortyapi.com/api/episode"}'

This is exactly the same thing you saw in your browser. 

Note, this looks like a dictionary but its actuall a string (the complete thing is present inside single qoutes, `''` ). To get the JSON version (equivalent to python dictionaries), we can use `.json()` method.

In [97]:
data = response.json()
data

{'characters': 'https://rickandmortyapi.com/api/character',
 'locations': 'https://rickandmortyapi.com/api/location',
 'episodes': 'https://rickandmortyapi.com/api/episode'}

If you check the type of `data` variable, its a python dictionary. 

In [98]:
type(data)

dict

Now we can access different values, using dict keys. Just like we normally do.

In [99]:
data['characters']

'https://rickandmortyapi.com/api/character'

Lets download all the characters data. Since we are requesting to server to collect the `characters` from the `API`, so again we are using `requests.get()` function. 

In [100]:
character_resp = requests.get(data['characters'])
character_resp.status_code

200

```{tip}
Check the response status code whenever you make a request to the server.
```

In [101]:
character_data = character_resp.json()

As we have already seen, `.json()` returns a python dictionary. So, we can call `.keys()` methods to look at all the dict keys.

In [102]:
character_data.keys()

dict_keys(['info', 'results'])

We have two keys. Lets look at them one by one . . .

In [103]:
character_data['info']

{'count': 826,
 'pages': 42,
 'next': 'https://rickandmortyapi.com/api/character?page=2',
 'prev': None}

In [104]:
var1=character_data['results'][:20]
var1

[{'id': 1,
  'name': 'Rick Sanchez',
  'status': 'Alive',
  'species': 'Human',
  'type': '',
  'gender': 'Male',
  'origin': {'name': 'Earth (C-137)',
   'url': 'https://rickandmortyapi.com/api/location/1'},
  'location': {'name': 'Citadel of Ricks',
   'url': 'https://rickandmortyapi.com/api/location/3'},
  'image': 'https://rickandmortyapi.com/api/character/avatar/1.jpeg',
  'episode': ['https://rickandmortyapi.com/api/episode/1',
   'https://rickandmortyapi.com/api/episode/2',
   'https://rickandmortyapi.com/api/episode/3',
   'https://rickandmortyapi.com/api/episode/4',
   'https://rickandmortyapi.com/api/episode/5',
   'https://rickandmortyapi.com/api/episode/6',
   'https://rickandmortyapi.com/api/episode/7',
   'https://rickandmortyapi.com/api/episode/8',
   'https://rickandmortyapi.com/api/episode/9',
   'https://rickandmortyapi.com/api/episode/10',
   'https://rickandmortyapi.com/api/episode/11',
   'https://rickandmortyapi.com/api/episode/12',
   'https://rickandmortyapi.com

Let's convert this data into a pandas dataframe.

In [105]:
import pandas as pd

character_df = pd.DataFrame(character_data['results'])
character_df.head()

Unnamed: 0,id,name,status,species,type,gender,origin,location,image,episode,url,created
0,1,Rick Sanchez,Alive,Human,,Male,"{'name': 'Earth (C-137)', 'url': 'https://rick...","{'name': 'Citadel of Ricks', 'url': 'https://r...",https://rickandmortyapi.com/api/character/avat...,"[https://rickandmortyapi.com/api/episode/1, ht...",https://rickandmortyapi.com/api/character/1,2017-11-04T18:48:46.250Z
1,2,Morty Smith,Alive,Human,,Male,"{'name': 'unknown', 'url': ''}","{'name': 'Citadel of Ricks', 'url': 'https://r...",https://rickandmortyapi.com/api/character/avat...,"[https://rickandmortyapi.com/api/episode/1, ht...",https://rickandmortyapi.com/api/character/2,2017-11-04T18:50:21.651Z
2,3,Summer Smith,Alive,Human,,Female,"{'name': 'Earth (Replacement Dimension)', 'url...","{'name': 'Earth (Replacement Dimension)', 'url...",https://rickandmortyapi.com/api/character/avat...,"[https://rickandmortyapi.com/api/episode/6, ht...",https://rickandmortyapi.com/api/character/3,2017-11-04T19:09:56.428Z
3,4,Beth Smith,Alive,Human,,Female,"{'name': 'Earth (Replacement Dimension)', 'url...","{'name': 'Earth (Replacement Dimension)', 'url...",https://rickandmortyapi.com/api/character/avat...,"[https://rickandmortyapi.com/api/episode/6, ht...",https://rickandmortyapi.com/api/character/4,2017-11-04T19:22:43.665Z
4,5,Jerry Smith,Alive,Human,,Male,"{'name': 'Earth (Replacement Dimension)', 'url...","{'name': 'Earth (Replacement Dimension)', 'url...",https://rickandmortyapi.com/api/character/avat...,"[https://rickandmortyapi.com/api/episode/6, ht...",https://rickandmortyapi.com/api/character/5,2017-11-04T19:26:56.301Z


Similarly, lets download locations and episodes data.

In [106]:
## Locations
location_resp = requests.get(data['locations'])
location_data = location_resp.json()
location_df = pd.DataFrame(location_data['results'])
location_df.head()

Unnamed: 0,id,name,type,dimension,residents,url,created
0,1,Earth (C-137),Planet,Dimension C-137,"[https://rickandmortyapi.com/api/character/38,...",https://rickandmortyapi.com/api/location/1,2017-11-10T12:42:04.162Z
1,2,Abadango,Cluster,unknown,[https://rickandmortyapi.com/api/character/6],https://rickandmortyapi.com/api/location/2,2017-11-10T13:06:38.182Z
2,3,Citadel of Ricks,Space station,unknown,"[https://rickandmortyapi.com/api/character/8, ...",https://rickandmortyapi.com/api/location/3,2017-11-10T13:08:13.191Z
3,4,Worldender's lair,Planet,unknown,"[https://rickandmortyapi.com/api/character/10,...",https://rickandmortyapi.com/api/location/4,2017-11-10T13:08:20.569Z
4,5,Anatomy Park,Microverse,Dimension C-137,"[https://rickandmortyapi.com/api/character/12,...",https://rickandmortyapi.com/api/location/5,2017-11-10T13:08:46.060Z


In [107]:
## Episodes
episode_resp = requests.get(data['episodes'])
episode_data = episode_resp.json()
episode_df = pd.DataFrame(episode_data['results'])
data=episode_df.head()
data

Unnamed: 0,id,name,air_date,episode,characters,url,created
0,1,Pilot,"December 2, 2013",S01E01,"[https://rickandmortyapi.com/api/character/1, ...",https://rickandmortyapi.com/api/episode/1,2017-11-10T12:56:33.798Z
1,2,Lawnmower Dog,"December 9, 2013",S01E02,"[https://rickandmortyapi.com/api/character/1, ...",https://rickandmortyapi.com/api/episode/2,2017-11-10T12:56:33.916Z
2,3,Anatomy Park,"December 16, 2013",S01E03,"[https://rickandmortyapi.com/api/character/1, ...",https://rickandmortyapi.com/api/episode/3,2017-11-10T12:56:34.022Z
3,4,M. Night Shaym-Aliens!,"January 13, 2014",S01E04,"[https://rickandmortyapi.com/api/character/1, ...",https://rickandmortyapi.com/api/episode/4,2017-11-10T12:56:34.129Z
4,5,Meeseeks and Destroy,"January 20, 2014",S01E05,"[https://rickandmortyapi.com/api/character/1, ...",https://rickandmortyapi.com/api/episode/5,2017-11-10T12:56:34.236Z


Great, we have downloaded all the data!

Often times, we will have to do some pre-processing before you can use your data. We can skip preprocessing here because this data is pretty clean.

## Exercise

Now that we have our data, try answering the following questions based on it.

- List the names of all the episodes released (`air_date`) in 2014.
- What is the origin location of `Rick Sanchez` ?
- List the names of all the episodes with a release date on or after 2014, in alphabetical order.
- List the names and release years of all the episodes where `Summer Smith` is present, in chronological order.
- What are the total episodes in which `Jerry Smith` appeared?
- List all characters created in 2017, in descending order by `species`. For characters with the same species, order them alphabetically by their `name`.
- List the names of all characters present in `S01E05`
- List the names of all characters present in an episode released in 2014.
- Name the last location of `Annie`, from most recent episode.
- List the names of all episodes in which both `Jerry Smith` and `Summer Smith` appeared.
- List the names of all characters who appeared in an episode in which `Amish Cyborg` also appeared.
- List ids of top 5 characters with maximum number of episodes. Start from the highest.
- What is the name of the episode with maximum number of characters?

Try to answer all these questions. We are pretty sure that you will enjoy solving on them.

Discuss all your solutions with your mentors, see if you got all of them right!

In [109]:
date=pd.to_datetime(episode_df["created"])
date

0    2017-11-10 12:56:33.798000+00:00
1    2017-11-10 12:56:33.916000+00:00
2    2017-11-10 12:56:34.022000+00:00
3    2017-11-10 12:56:34.129000+00:00
4    2017-11-10 12:56:34.236000+00:00
5    2017-11-10 12:56:34.339000+00:00
6    2017-11-10 12:56:34.441000+00:00
7    2017-11-10 12:56:34.543000+00:00
8    2017-11-10 12:56:34.645000+00:00
9    2017-11-10 12:56:34.747000+00:00
10   2017-11-10 12:56:34.850000+00:00
11   2017-11-10 12:56:34.953000+00:00
12   2017-11-10 12:56:35.055000+00:00
13   2017-11-10 12:56:35.158000+00:00
14   2017-11-10 12:56:35.261000+00:00
15   2017-11-10 12:56:35.364000+00:00
16   2017-11-10 12:56:35.467000+00:00
17   2017-11-10 12:56:35.569000+00:00
18   2017-11-10 12:56:35.669000+00:00
19   2017-11-10 12:56:35.772000+00:00
Name: created, dtype: datetime64[ns, UTC]

In [None]:
datee=date["species"].between(2017, 2017, inclusive = True)
data[datee]

In [None]:
character_df.iloc[:6]

## 3. Web Scraping

The 3rd & the last way of collecting data is Web scraping. Web is the greatest source of information known to mankind. The ability to scrape data from the web is of immense importance in any field of research. 

This becomes even more important in fields (like Data Science, Machine Learning, & Deep Learning) were amount of data is directly proportional to the quality (both product & research).

**Web scraping**, allows us to extract, parse, download and organize useful information from the web, automatically.

Discussing web scraping in depth is beyond the scope of this course. But you can read this [wonderful blog](https://www.kaggle.com/ankursingh12/create-your-own-dataset-with-beautifulsoup) by our teammate [Ankur](https://www.linkedin.com/in/ankur-singh-ml/), to get started with web scraping.

The blog, will discuss the following:

👉 Basics of web scraping.

👉 How you can create your own dataset for data science.

👉 Some exercise for practice.

## Conclusion

These are some of the ways of collecting data. As a Data Scientist, you should know (almost) everything! Web scraping, APIs, bash, pandas, numpy, data visualization, machine learning, databases and so much more . . . 

### Questionaire

1. What are different ways to collect data from the internet?
2. How does `API` works?
3. Is web scraping legal? How to check if we can scrape data from certain website?
4. Web Scraping can be done on which types of URLs?