## Types of data - structured and unstructured

Reading data is the first step to extract information from it. Data can exist broadly in two formats:

(1) Structured data and,
(2) Untructured data. 

Structured data is typically stored in a tabular form, where rows in the data correspond to "observations" and columns correspond to "variables". For example, the following dataset contains 5 observations, where each observation (or row) consists of information about a movie. The variables (or columns) contain different pieces of information about a given movie. As all variables for a given row are related to the same movie, the data below is also called as relational data.

In [15]:
#| echo: false
import pandas as pd
data = pd.read_csv('movies_sample_data.csv')
data.head()

Unnamed: 0,Title,US Gross,Production Budget,Release Date,Major Genre,Creative Type,Rotten Tomatoes Rating,IMDB Rating
0,The Shawshank Redemption,28241469,25000000,Sep 23 1994,Drama,Fiction,88,9.2
1,Inception,285630280,160000000,Jul 16 2010,Horror/Thriller,Fiction,87,9.1
2,One Flew Over the Cuckoo's Nest,108981275,4400000,Nov 19 1975,Comedy,Fiction,96,8.9
3,The Dark Knight,533345358,185000000,Jul 18 2008,Action/Adventure,Fiction,93,8.9
4,Schindler's List,96067179,25000000,Dec 15 1993,Drama,Non-Fiction,97,8.9


Unstructured data is data that is not organized in any pre-defined manner. Examples of unstructured data can be text files, audio/video files, images, Internet of Things (IoT) data, etc. Unstructured data is relatively harder to analyze as most of the analytical methods and tools are oriented towards structured data. However, an unstructured data can be used to obtain structured data, which in turn can be analyzed. For example, an image can be converted to an array of pixels - which will be structured data. Machine learning algorithms can then be used on the array to classify the image as that of a dog or a cat. 

In this course, we will focus on analyzing structured data.

## Reading a *csv* file with *Pandas*

Structured data can be stored in a variety of formats. The most popular format is *data_file_name.csv*, where the extension *csv* stands for comma separated values. The variable values of each observation are separated by a comma in a *.csv* file. In other words, the **delimiter** is a comma in a *csv* file. However, the comma is not visible when a *.csv* file is opened with Microsoft Excel. 

### Using the *read_csv* function

We will use functions from the *Pandas* library of *Python* to read data. Let us import *Pandas* to use its functions.

In [22]:
import pandas as pd

Note that *pd* is the acronym that we will use to call a *Pandas* function. This acronym can be anything as desired by the user.

The function to read a *csv* file is [read_csv()](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html). It reads the dataset into an object of type [Pandas DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html). Let us read the dataset *movie_ratings.csv* in Python.

In [25]:
movie_ratings = pd.read_csv('movie_ratings.csv')

The built-in python function `type` can be used to check the dataype of an object:

In [26]:
type(movie_ratings)

pandas.core.frame.DataFrame

Note that the file *movie_ratings.csv* is stored at the same location as the python script containing the above code. If that is not the case, we'll need to specify the location of the file as in the following code.

In [None]:
#| eval: false
movie_ratings = pd.read_csv('D:/Books/DataScience_Intro_python/movie_ratings.csv')

Note that forward slash is used instead of backslash while specifying the path of the data file. Another option is to use two consecutive backslashes instead of a single forward slash.

### Specifying the working directory

In case we need to read several datasets from a given location, it may be inconvenient to specify the path every time. In such a case we can change the current working directory to the location where the datasets are located.

We'll use the *os* library of *Python* to view and/or change the current working directory.

In [11]:
import os #Importing the 'os' library
os.getcwd() #Getting the path to the current working directory

'C:\\Users\\akl0407\\Desktop\\STAT303-1\\Quarto Book\\DataScience_Intro_python'

The function *getcwd()* stands for get current working directory.

Suppose the dataset to be read is located at 'D:\Books\DataScience_Intro_python\Datasets'. Then, we'll use the function *chdir* to change the current working directory to this location.

In [12]:
#| eval: false
os.chdir('D:/Books/DataScience_Intro_python/Datasets')

Now we can read the dataset from this location without mentioning the entire path as shown below.

In [13]:
movie_ratings = pd.read_csv('movie_ratings.csv')

### Data overview and summary statistics

Once the data has been read, we may want to see what the data looks like. We'll use another *Pandas* function [head()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html) to view the first few rows of the data.

In [14]:
movie_ratings.head()

Unnamed: 0,Title,US Gross,Worldwide Gross,Production Budget,Release Date,MPAA Rating,Source,Major Genre,Creative Type,IMDB Rating,IMDB Votes
0,Opal Dreams,14443,14443,9000000,Nov 22 2006,PG/PG-13,Adapted screenplay,Drama,Fiction,6.5,468
1,Major Dundee,14873,14873,3800000,Apr 07 1965,PG/PG-13,Adapted screenplay,Western/Musical,Fiction,6.7,2588
2,The Informers,315000,315000,18000000,Apr 24 2009,R,Adapted screenplay,Horror/Thriller,Fiction,5.2,7595
3,Buffalo Soldiers,353743,353743,15000000,Jul 25 2003,R,Adapted screenplay,Comedy,Fiction,6.9,13510
4,The Last Sin Eater,388390,388390,2200000,Feb 09 2007,PG/PG-13,Adapted screenplay,Drama,Fiction,5.7,1012


**Row Indices and column names (axis labels)**: \
The bold integers on the left are the indices of the DataFrame. Each index refers to a distinct row. For example, the index *2* correponds to the row of the movie *The Informers*. By default, the indices are integers starting from 0. However, they can be changed (to even non-integer values) if desired by the user.

The bold text on top of the DataFrame refers to column names. For example, the column *US Gross* consists of the gross revenue of a movie in the US.

Collectively, the indices and column names are referred as **axis labels**.

**Shape of DataFrame**: \
For finding the number of rows and columns in the data, you may use the [shape()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shape.html) function.

In [179]:
#Finding the shape of movie_ratings dataset
movie_ratings.shape

(2228, 11)

The *movie_ratings* dataset contains 2,809 observations (or rows) and 15 variables (or columns).

For obtaining summary statistics of data, you may use the [describe()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html) function.

In [180]:
#Finding summary statistics of movie_ratings dataset
movie_ratings.describe()

Unnamed: 0,US Gross,Worldwide Gross,Production Budget,IMDB Rating,IMDB Votes
count,2228.0,2228.0,2228.0,2228.0,2228.0
mean,50763700.0,101937000.0,38160550.0,6.239004,33585.154847
std,66430810.0,164858900.0,37826040.0,1.243285,47325.651561
min,0.0,884.0,218.0,1.4,18.0
25%,9646188.0,13207370.0,12000000.0,5.5,6659.25
50%,28386490.0,42668920.0,26000000.0,6.4,18169.0
75%,64531400.0,120000000.0,53000000.0,7.1,40092.75
max,760167600.0,2767891000.0,300000000.0,9.2,519541.0


Answer the following questions based on the above table.

In [6]:
#| echo: false

from jupyterquiz import display_quiz
import json

In [11]:
#| echo: false

with open("./Datasets/questions_mc.json", "r") as file:
    questions=json.load(file)
display_quiz(questions)




In [9]:
#| echo: false

with open("./Datasets/questions_numeric.json", "r") as file:
    questions=json.load(file)
display_quiz(questions)




## Reading other data formats - txt, html, json

Although *csv* is a very popular format for strucutred data, data is found in several other formats as well. Some of the other data formats are *txt, html* and *json*.

### Reading *txt* files

The *txt* format offers some additional flexibility as compared to the *csv* format. In the *csv* format, the delimiter is a comma (or the column values are separated by a comma). However, in a *txt* file, the delimiter can be anything as desired by the user. Let us read the file *movie_ratings.txt*, where the variable values are separated by a tab character.

In [15]:
#| eval: false
movie_ratings_txt = pd.read_csv('movie_ratings.txt',sep='\t')

We use the function *read_csv* to read a *txt* file. However, we mention the tab character ('\t') as a separater of variable values.

Note that there is no need to remember the argument name - *sep* for specifying the delimiter. You can always refer to the [read_csv()](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) documentation to find the relevant argument.

### Reading HTML data

The *Pandas* function *read_html* searches for tabular data, i.e., data contained within the *\<table\>* tags of an html file. Let us read the tables in the GDP per capita [page](https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita) on Wikipedia.

In [7]:
#Reading all the tables from the Wikipedia page on GDP per capita
tables = pd.read_html('https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita')

All the tables will be read and stored in the variable named as *tables*. Let us find the datatype of the variable *tables*.

In [16]:
#Finidng datatype of the variable - tables
type(tables)

list

The variable - tables is a list of all the tables read from the HTML data.

In [9]:
#Number of tables read from the page
len(tables)

6

The in-built function *len* can be used to find the length of the list - *tables* or the number of tables read from the Wikipedia page. Let us check out the first table.

In [10]:
#Checking out the first table. Note that the index of the first table will be 0.
tables[0]

Unnamed: 0,0,1,2
0,.mw-parser-output .legend{page-break-inside:av...,"$20,000 - $30,000 $10,000 - $20,000 $5,000 - $...","$1,000 - $2,500 $500 - $1,000 <$500 No data"


The above table doesn't seem to be useful. Let us check out the second table.

In [11]:
#Checking out the second table. Note that the index of the first table will be 1.
tables[1]

Unnamed: 0_level_0,Country/Territory,UN Region,IMF[4][5],IMF[4][5],United Nations[6],United Nations[6],World Bank[7],World Bank[7]
Unnamed: 0_level_1,Country/Territory,UN Region,Estimate,Year,Estimate,Year,Estimate,Year
0,Liechtenstein *,Europe,—,—,180227,2020,169049,2019
1,Monaco *,Europe,—,—,173696,2020,173688,2020
2,Luxembourg *,Europe,135046,2022,117182,2020,135683,2021
3,Bermuda *,Americas,—,—,123945,2020,110870,2021
4,Ireland *,Europe,101509,2022,86251,2020,85268,2020
...,...,...,...,...,...,...,...,...
212,Central AfricanRepublic *,Africa,527,2022,481,2020,477,2020
213,Sierra Leone *,Africa,513,2022,475,2020,485,2020
214,Madagascar *,Africa,504,2022,470,2020,496,2020
215,South Sudan *,Africa,393,2022,1421,2020,1120,2015


The above table contains the estimated GDP per capita of all countries. This is the table that is likely to be relevant to a user interested in analyzing GDP per capita of countries. Instead of reading all tables of an HTML file, we can focus the search to tables containing certain relevant keywords. Let us try searching all table containing the word 'Country'.

In [13]:
#Reading all the tables from the Wikipedia page on GDP per capita, containing the word 'Country'
tables = pd.read_html('https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita', match = 'Country')

The *match* argument can be used to specify the kewyords to be present in the table to be read.

In [15]:
len(tables)

1

Only one table contains the keyword - 'Country'. Let us check out the table obtained.

In [18]:
#Table having the keyword - 'Country' from the HTML page
tables[0]

Unnamed: 0_level_0,Country/Territory,UN Region,IMF[4][5],IMF[4][5],United Nations[6],United Nations[6],World Bank[7],World Bank[7]
Unnamed: 0_level_1,Country/Territory,UN Region,Estimate,Year,Estimate,Year,Estimate,Year
0,Liechtenstein *,Europe,—,—,180227,2020,169049,2019
1,Monaco *,Europe,—,—,173696,2020,173688,2020
2,Luxembourg *,Europe,135046,2022,117182,2020,135683,2021
3,Bermuda *,Americas,—,—,123945,2020,110870,2021
4,Ireland *,Europe,101509,2022,86251,2020,85268,2020
...,...,...,...,...,...,...,...,...
212,Central AfricanRepublic *,Africa,527,2022,481,2020,477,2020
213,Sierra Leone *,Africa,513,2022,475,2020,485,2020
214,Madagascar *,Africa,504,2022,470,2020,496,2020
215,South Sudan *,Africa,393,2022,1421,2020,1120,2015


The argument *match* helps with a more focussed search, and helps us discard irrelevant tables.

### Reading JSON data

JSON stands for JavaScript Object Notation, in which the data is stored and transmitted as plain text. Since the format is text only, JSON data can easily be exchanged between web applications, and used by any programming language. Unlinke the *csv* format, JSON supports a hierarchical data structure, and is easier to integrate with APIs.

Lets read JSON data on Ted Talks. The *Pandas* function [read_json] (https://pandas.pydata.org/docs/reference/api/pandas.read_json.html) converts JSON data to a dataframe.

In [21]:
tedtalks_data = pd.read_json('https://raw.githubusercontent.com/cwkenwaysun/TEDmap/master/data/TED_Talks.json')

In [27]:
tedtalks_data.head()

Unnamed: 0,id,speaker,headline,URL,description,transcript_URL,month_filmed,year_filmed,event,duration,date_published,tags,newURL,date,views,rates
0,7,David Pogue,Simplicity sells,http://www.ted.com/talks/view/id/7,New York Times columnist David Pogue takes aim...,http://www.ted.com/talks/view/id/7/transcript?...,2,2006,TED2006,0:21:26,6/27/06,"simplicity,computers,software,interface design...",https://www.ted.com/talks/david_pogue_says_sim...,2006-06-27,1646773,"[{'id': 7, 'name': 'Funny', 'count': 968}, {'i..."
1,6,Craig Venter,Sampling the ocean's DNA,http://www.ted.com/talks/view/id/6,Genomics pioneer Craig Venter takes a break fr...,http://www.ted.com/talks/view/id/6/transcript?...,7,2005,TEDGlobal 2005,0:16:51,2004/05/07,"biotech,invention,oceans,genetics,DNA,biology,...",https://www.ted.com/talks/craig_venter_on_dna_...,2004-05-07,562625,"[{'id': 3, 'name': 'Courageous', 'count': 21},..."
2,4,Burt Rutan,The real future of space exploration,http://www.ted.com/talks/view/id/4,"In this passionate talk, legendary spacecraft ...",http://www.ted.com/talks/view/id/4/transcript?...,2,2006,TED2006,0:19:37,10/25/06,"aircraft,flight,industrial design,NASA,rocket ...",https://www.ted.com/talks/burt_rutan_sees_the_...,2006-10-25,2046869,"[{'id': 3, 'name': 'Courageous', 'count': 169}..."
3,3,Ashraf Ghani,How to rebuild a broken state,http://www.ted.com/talks/view/id/3,Ashraf Ghani's passionate and powerful 10-minu...,http://www.ted.com/talks/view/id/3/transcript?...,7,2005,TEDGlobal 2005,0:18:45,10/18/06,"corruption,poverty,economics,investment,milita...",https://www.ted.com/talks/ashraf_ghani_on_rebu...,2006-10-18,814554,"[{'id': 3, 'name': 'Courageous', 'count': 140}..."
4,5,Chris Bangle,Great cars are great art,http://www.ted.com/talks/view/id/5,American designer Chris Bangle explains his ph...,http://www.ted.com/talks/view/id/5/transcript?...,2,2002,TED2002,0:20:04,2004/05/07,"cars,industrial design,transportation,inventio...",https://www.ted.com/talks/chris_bangle_says_gr...,2004-05-07,870950,"[{'id': 1, 'name': 'Beautiful', 'count': 89}, ..."


In [41]:
#| echo: false

with open("question_json_data.json", "r") as file:
    questions=json.load(file)
display_quiz(questions)




### Reading data from web APIs

[API](https://en.wikipedia.org/wiki/API), an acronym for Application programming interface, is a way of transferring information between systems. Many websites have public APIs that provide data via JSON or other formats. For example, the [IMDb-API](https://imdb-api.com/) is a web service for receiving movies, serial, and cast information. API results are in the JSON format and include items such as movie specifications, ratings, Wikipedia page content, etc. One of these APIs contains ratings of the top 250 movies on IMDB. Let us read this data using the IMDB API. 

We'll use the *get* function from the python library *requests* to request data from the API and obtain a response code. The response code will let us know if our request to pull data from this API was successful.

In [124]:
#Importing the requests library
import requests as rq

In [125]:
# Downloading imdb top 250 movie's data
url = 'https://imdb-api.com/en/API/Top250Movies/k_v6gf8ppf' #URL of the API containing top 250 movies based on IMDB ratings
response = rq.get(url) #Requesting data from the API
response

<Response [200]>

We have a response code of 200, which indicates that the request was successful.

The response object's JSON method will return a dictionary containing JSON parsed into native Python objects.

In [122]:
movie_data = response.json()

In [123]:
movie_data.keys()

dict_keys(['items', 'errorMessage'])

The *movie_data* contains only two keys. The *items* key seems likely to contain information about the top 250 movies. Let us convert the values of the *items* key (which is list of dictionaries) to a dataframe, so that we can view it in a tabular form.

In [126]:
#Converting a list of dictionaries to a dataframe
movie_data_df = pd.DataFrame(movie_data['items'])

In [113]:
#Checking the movie data pulled using the API
movie_data_df.head()

Unnamed: 0,id,rank,title,fullTitle,year,image,crew,imDbRating,imDbRatingCount
0,tt0111161,1,The Shawshank Redemption,The Shawshank Redemption (1994),1994,https://m.media-amazon.com/images/M/MV5BMDFkYT...,"Frank Darabont (dir.), Tim Robbins, Morgan Fre...",9.2,2624065
1,tt0068646,2,The Godfather,The Godfather (1972),1972,https://m.media-amazon.com/images/M/MV5BM2MyNj...,"Francis Ford Coppola (dir.), Marlon Brando, Al...",9.2,1817542
2,tt0468569,3,The Dark Knight,The Dark Knight (2008),2008,https://m.media-amazon.com/images/M/MV5BMTMxNT...,"Christopher Nolan (dir.), Christian Bale, Heat...",9.0,2595637
3,tt0071562,4,The Godfather Part II,The Godfather Part II (1974),1974,https://m.media-amazon.com/images/M/MV5BMWMwMG...,"Francis Ford Coppola (dir.), Al Pacino, Robert...",9.0,1248050
4,tt0050083,5,12 Angry Men,12 Angry Men (1957),1957,https://m.media-amazon.com/images/M/MV5BMWU4N2...,"Sidney Lumet (dir.), Henry Fonda, Lee J. Cobb",8.9,775140


In [114]:
#Rows and columns of the movie data
movie_data_df.shape

(250, 9)

This API provides the names of the top 250 movies along with the year of release, IMDB ratings, and cast information.

## Writing data

The *Pandas* function *to_csv* can be used to write (or export) data to a *csv* or *txt* file. Below are some examples.

**Example 1:** Let us export the movies data of the top 250 movies to a *csv* file.

In [130]:
#Exporting the data of the top 250 movies to a csv file
movie_data_df.to_csv('movie_data_exported.csv')

The file *movie_data_exported.csv* will appear in the working directory.

**Example 2:** Let us export the movies data of the top 250 movies to a *txt* file with a semi-colon as the delimiter.

In [131]:
movie_data_df.to_csv('movie_data_exported.txt',sep=';')

**Example 3:** Let us export the movies data of the top 250 movies to a *JSON* file.

In [136]:
with open('movie_data.json', 'w') as f:
    json.dump(movie_data, f)

## Sub-setting data: `loc` and `iloc`

Sometimes we may be interested in working with a subset of rows and columns of the data, instead of working with the entire dataset. The indexing operators [loc](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html) and [iloc](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html) provide a convenient of selecting a subset of desired rows and columns. The operator `loc` uses axis labels (row indices and column names) to subset the data, while `iloc` uses the index (this is different from the row index) corresponding to the position of the row or columns. Note that the index of the position for both the row and column starts from 0. 

Let us read the file *movie_IMDBratings_sorted.csv*, which has movies sorted in the descending order of their IMDB ratings. 

In [43]:
movies_sorted = pd.read_csv('./Datasets/movie_IMDBratings_sorted.csv',index_col = 0)

The argument `index_col=0` assigns the first column of the file as the row indices of the DataFrame.

In [44]:
movies_sorted.head()

Unnamed: 0_level_0,Title,US Gross,Worldwide Gross,Production Budget,Release Date,MPAA Rating,Source,Major Genre,Creative Type,IMDB Rating,IMDB Votes
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,The Shawshank Redemption,28241469,28241469,25000000,Sep 23 1994,R,Adapted screenplay,Drama,Fiction,9.2,519541
2,Inception,285630280,753830280,160000000,Jul 16 2010,PG/PG-13,Original Screenplay,Horror/Thriller,Fiction,9.1,188247
3,The Dark Knight,533345358,1022345358,185000000,Jul 18 2008,PG/PG-13,Adapted screenplay,Action/Adventure,Fiction,8.9,465000
4,Schindler's List,96067179,321200000,25000000,Dec 15 1993,R,Adapted screenplay,Drama,Non-Fiction,8.9,276283
5,Pulp Fiction,107928762,212928762,8000000,Oct 14 1994,R,Original Screenplay,Drama,Fiction,8.9,417703


Let us say, we wish to subset the title, worldwide gross, production budget, and IMDB raring of top 3 movies.

In [48]:
# Subsetting the DataFrame by loc - using axis labels
movies_subset = movies_sorted.loc[1:3,['Title','Worldwide Gross','Production Budget','IMDB Rating']]
movies_subset

Unnamed: 0_level_0,Title,Worldwide Gross,Production Budget,IMDB Rating
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,The Shawshank Redemption,28241469,25000000,9.2
2,Inception,753830280,160000000,9.1
3,The Dark Knight,1022345358,185000000,8.9


In [49]:
# Subsetting the DataFrame by iloc - using index of the position of rows and columns
movies_subset = movies_sorted.iloc[0:3,[0,2,3,9]]
movies_subset

Unnamed: 0_level_0,Title,Worldwide Gross,Production Budget,IMDB Rating
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,The Shawshank Redemption,28241469,25000000,9.2
2,Inception,753830280,160000000,9.1
3,The Dark Knight,1022345358,185000000,8.9
