# Data Acquisition

A data set is typically a file that holds information in different formats, such as **.csv, .json, or .xlsx.** This data can be stored in various places like your computer, a server, a website, or cloud storage. When we want to analyze data in a Python notebook, we need to bring that data into the notebook.

Data type: csv The Pandas Library is a very popular and very useful tool that enables us to read various datasets into a data frame; our Jupyter notebook platforms have a built-in Pandas Library so that all we need to do is import Pandas without installing.

# Contents

1. About pip install requests
2. Bigmac Dataset
#### Extracting the Dataset from Kaggle
3. SyntaxError: Unicode Error
4. Features of the DataFrame
5. Add Headers
6. Checking for NaN Values in the DataFrame
7. dropna() method
8. Save/Read the dataset
#### Exploring the Dataset
9. Data Types
10. Get Statistical Summary
11. info() method

## 1. pip install requests

If you're running the code within a Jupyter Notebook and if you haven't already installed the requests library, you will indeed need to install it using <code>pip install requests</code> in your terminal or command prompt.

After executing this command, you should be able to import and use the requests library in your notebook. Then, you can proceed with the rest of the code to download the dataset.

This code will download the dataset from the provided URL and save it with the specified filename in the current directory. You can then use <code>pandas.read_csv()</code> to read the CSV file into a DataFrame as usual.

In [1]:
pip install requests

Note: you may need to restart the kernel to use updated packages.


# 2.  BigMac Dataset

This is the link for the Bigmac Dataset.

**Mcdonalds' Bigmac price for every country in the world from 2000 to 2022.**

https://www.kaggle.com/datasets/vittoriogiatti/bigmacprice/data

# Extracting the Dataset from Kaggle

1. Extracting data online by downloading a dataset from kaggle, and saving into a csv file into my computer.

2. Then, copy the file path of the downloaded dataset in csv form and convert in the following:


## 3. What to do when ecountering SyntaxError: (unicode error)

- **copied path from excel file**

file path = "C:\Users\-----\Documents\Aira\All Python Files\Kaggle Data Sets\BigmacPrice.csv"

Convert this into either:

- **raw string literal**
file_path = r'C:\Users\-----\Documents\Aira\All Python Files\Kaggle Data Sets\BigmacPrice.csv'

or

- **escaping backslashes**
file_path = 'C:\\Users\\-----\\Documents\\Aira\\All Python Files\\Kaggle Data Sets\\BigmacPrice.csv'


In [4]:
# Import necessary libraries
import pandas as pd

# Define the file path to your CSV file
# Replace 'path_to_your_csv_file.csv' with the actual file path
file_path = r'C:\Users\jocke\Documents\Aira\All Python Files\Kaggle Data Sets\BigmacPrice.csv'

# Read the CSV file into a DataFrame called "bigmac_df"
bigmac_df = pd.read_csv(file_path)

# Display the first few rows of the DataFrame to check if the data is imported correctly
bigmac_df.head()


Unnamed: 0,date,currency_code,name,local_price,dollar_ex,dollar_price
0,2000-04-01,ARS,Argentina,2.5,1,2.5
1,2000-04-01,AUD,Australia,2.59,1,2.59
2,2000-04-01,BRL,Brazil,2.95,1,2.95
3,2000-04-01,GBP,Britain,1.9,1,1.9
4,2000-04-01,CAD,Canada,2.85,1,2.85


In [6]:
# display the whole dataframe
bigmac_df

Unnamed: 0,date,currency_code,name,local_price,dollar_ex,dollar_price
0,2000-04-01,ARS,Argentina,2.50,1,2.50
1,2000-04-01,AUD,Australia,2.59,1,2.59
2,2000-04-01,BRL,Brazil,2.95,1,2.95
3,2000-04-01,GBP,Britain,1.90,1,1.90
4,2000-04-01,CAD,Canada,2.85,1,2.85
...,...,...,...,...,...,...
1941,2022-07-01,AED,United Arab Emirates,18.00,3,6.00
1942,2022-07-01,USD,United States,5.15,1,5.15
1943,2022-07-01,UYU,Uruguay,255.00,41,6.22
1944,2022-07-01,VES,Venezuela,10.00,5,2.00


Here, you can see that the dataframe has 1946 rows and 6 columns.

## 4. Features of the data frame

![image.png](attachment:image.png)

## 5. Add Headers

Take a look at the data set. Let's say you want to **rename** the "Name" column to "Country" column. Also, you want to **reset** the headers. **Pandas automatically set the header with an integer starting from 0.**

First, create a list "headers" that include all column names in order. Then, use <code>df.columns = headers</code> to replace the headers with the list you created.

In [8]:
# Create headers list
headers = ["Date", "Currency_Code", "Country", "Local_Price", "Dollar_Ex", "Dollar_Price"]

# Replace headers and recheck data frame
bigmac_df.columns = headers
bigmac_df.columns

Index(['Date', 'Currency_Code', 'Country', 'Local_Price', 'Dollar_Ex',
       'Dollar_Price'],
      dtype='object')

In [9]:
bigmac_df

Unnamed: 0,Date,Currency_Code,Country,Local_Price,Dollar_Ex,Dollar_Price
0,2000-04-01,ARS,Argentina,2.50,1,2.50
1,2000-04-01,AUD,Australia,2.59,1,2.59
2,2000-04-01,BRL,Brazil,2.95,1,2.95
3,2000-04-01,GBP,Britain,1.90,1,1.90
4,2000-04-01,CAD,Canada,2.85,1,2.85
...,...,...,...,...,...,...
1941,2022-07-01,AED,United Arab Emirates,18.00,3,6.00
1942,2022-07-01,USD,United States,5.15,1,5.15
1943,2022-07-01,UYU,Uruguay,255.00,41,6.22
1944,2022-07-01,VES,Venezuela,10.00,5,2.00


## 6. Checking for NaN Values in the DataFrame

There could be NaN(Not a Number) values in the DataFrame, which could be objects in Pandas. However, since it is a huge dataset with 1946 rows, it is time consuming to check one by one.

You can check for NaN values per column or for the entire DataFrame at once. Here are some methods:

<code>x = df.isnull().sum()</code>
This will give the count of NaN values in **each column** of the DataFrame.

In [10]:
# Check for NaN values per column
nan_per_column = bigmac_df.isnull().sum()
nan_per_column

Date             0
Currency_Code    0
Country          0
Local_Price      0
Dollar_Ex        0
Dollar_Price     0
dtype: int64

<code>x = df.isnull().sum().sum()</code>
This will give the total count of NaN values in the **entire DataFrame.**

In [11]:
# Check for NaN values in the entire DataFrame
nan_total = bigmac_df.isnull().sum().sum()
print("Total NaN values in DataFrame:", nan_total)


Total NaN values in DataFrame: 0


This implies that the dataset is complete, and there are no missing values in any of the columns or the dataset.

## 7. dropna() method

However, in a case wherein there are NaN values found, it is important to drop NaN (Not a Number) objects in Pandas because they represent **missing or undefined values** in the data. NaN values can affect the accuracy and reliability of your data analysis.

The <code>dropna()</code> method in pandas is used to remove missing values (NaN, null values) from a DataFrame or Series object. It provides flexibility in terms of which **axis (rows or columns)** to consider for dropping, as well as the **threshold for the number of missing values required to trigger dropping.**

**Axis** specifies whether to drop rows <code>(axis=0)</code> or columns <code>(axis=1)</code> that contain missing values. **By default, it's set to 0, meaning it drops rows.**

In [12]:
"""sample code if NaN Values are found in dataset
# drop missing values along the column "x" as follows
df = df1.dropna(subset = ["x"], axis = 0)
df.head()
"""

'sample code if NaN Values are found in dataset\n# drop missing values along the column "x" as follows\ndf = df1.dropna(subset = ["x"], axis = 0)\ndf.head()\n'

# 8. Save Dataset

Correspondingly, Pandas enables you to save the data set to CSV. By using the <code>dataframe.to_csv()</code> method, you can add the file path and name along with quotation marks in the brackets.


If you want to save the DataFrame named df as a file named **bigmac.csv** on your computer, you can use the following code.

The <code>index=False</code> part means that the row names or indices will not be saved along with the data.

In [14]:
bigmac_df.to_csv("bigmac.csv", index=False)

## Read/Save Other Data Formats

You can also read and save other file formats. You can use similar functions like <code>pd.read_csv()</code> and <code>df.to_csv()</code> for other data formats. The functions are listed in the following table

![image.png](attachment:image.png)

# Exploring the Dataset

After Successfully extracting  and reading data into Pandas DataFrame, it is important to explore the date set. There are several ways to obtain essential insights of the data to help you better understand it. These can be done with the following:

## 9. Check the Data Types

Data has a variety of types. The main types stored in Pandas data frames are **object, float, int, bool and datetime64.** 

In order to better learn about each attribute, you should always know the data type of each column.

In [15]:
# Returns a series with the data type of each column.
bigmac_df.dtypes

Date              object
Currency_Code     object
Country           object
Local_Price      float64
Dollar_Ex          int64
Dollar_Price     float64
dtype: object

## 10. Get Summary Statistics
### bigmac_df.describe(include='all')

This code will display **summary statistics** for **all columns** in the DataFrame, including count, unique, top, and frequency for categorical columns, and mean, standard deviation, minimum, maximum, and quartiles for numerical columns.

In [18]:
# Display the summary of the DataFrame to check if the data is imported correctly
bigmac_df.describe(include='all')

Unnamed: 0,Date,Currency_Code,Country,Local_Price,Dollar_Ex,Dollar_Price
count,1946,1946,1946,1946.0,1946.0,1946.0
unique,37,58,74,,,
top,2022-01-01,EUR,Argentina,,,
freq,73,351,37,,,
mean,,,,15816.09,4722.255,3.568011
std,,,,394005.0,100623.2,1.417054
min,,,,0.0,1.0,0.0
25%,,,,4.45,1.0,2.5725
50%,,,,15.0,5.0,3.4
75%,,,,87.0,32.0,4.24


## describe() method: Select particular columns/series in the data frame

You can select the columns of a dataframe by indicating the name of each column. For example, you can select the three columns as follows:

<code>dataframe[[' column 1 ',column 2', 'column 3']]</code>

Where **colum** is the name of the column, you can apply the method <code>.describe()</code> to get the statistics of those columns as follows:

<code>dataframe[[' column 1 ',column 2', 'column 3'] ].describe()</code>



In [19]:
bigmac_df[['Local_Price', 'Dollar_Ex', 'Dollar_Price']].describe()

Unnamed: 0,Local_Price,Dollar_Ex,Dollar_Price
count,1946.0,1946.0,1946.0
mean,15816.09,4722.255,3.568011
std,394005.0,100623.2,1.417054
min,0.0,1.0,0.0
25%,4.45,1.0,2.5725
50%,15.0,5.0,3.4
75%,87.0,32.0,4.24
max,16020000.0,3613989.0,11.25


## 11. Info
**It provides a concise summary of the dataframe.**

This method prints information about a data frame including the index dtype and columns, non-null values and memory usage.

In [20]:
bigmac_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1946 entries, 0 to 1945
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Date           1946 non-null   object 
 1   Currency_Code  1946 non-null   object 
 2   Country        1946 non-null   object 
 3   Local_Price    1946 non-null   float64
 4   Dollar_Ex      1946 non-null   int64  
 5   Dollar_Price   1946 non-null   float64
dtypes: float64(2), int64(1), object(3)
memory usage: 91.3+ KB
