In [82]:
# First things first, import:
import pandas as pd
import numpy as np
import hashlib

- In this notebook the following is tested:

    - Creating pandas DataFrame
    - Load a dataset from a file
    - Preview a dataframe
    - Convert datatypes in DataFrame

## Exercise 1: DataFrames

The goal of this exercise is to create a simple DataFrame from several data structures.

In [83]:
# RUN THIS CELL FIRST
# this is the data you'll use to fill each column of your dataframe
emojis = ['Face with Tears of Joy', 'Red Heart', 'Rolling on the Floor Laughing', 'Thumbs Up']
search_engines = np.array(['Google', 'Bing', 'Yahoo!', 'Baidu'])
social_network = ['Facebook', 'Instagram', 'TikTok', 'Twitter']
social_network_active_users = [2700000000, 1200000000, 700000000, 20000000]

In [84]:
# Add the data from the lists emojis, search_engines, social_network, and social_network_active_users
# to a dictionary called most_popular_2021_dictionary:
#   - use the 4 variables created in the cell above to fill the data for each key
#   - each key should be a string containing the name of the corresponding variable.
most_popular_2021_dictionary = {'emojis': emojis,
                                'search_engines': search_engines,
                                'social_network': social_network,
                                'social_network_active_users': social_network_active_users}

# Create a dataframe called most_popular_2021_df
#   - set an index with the values 'first', 'second', 'third', 'fourth'
#   - use the dictionary created above to populate the dataframe.
most_popular_2021_df = pd.DataFrame(most_popular_2021_dictionary, index=['first', 'second', 'third', 'fourth'])



In [85]:
assert(isinstance(most_popular_2021_dictionary,dict)), 'Something is wrong! most_popular_2021_dictionary is not a dictionary.'
assert(isinstance(most_popular_2021_df, pd.DataFrame)), 'most_popular_2021 is not a DataFrame'
assert(most_popular_2021_df['emojis'].tolist()==emojis), "The emojis column doesn't look right."
assert(most_popular_2021_df['search_engines'].tolist()==list(search_engines)), "The search_engines column doesn't look right."
assert(most_popular_2021_df['social_network'].tolist()==social_network), "The social_network column doesn't look right."
assert(most_popular_2021_df.shape == (4, 4)), 'The size of the dataframe is not correct.'
assert(most_popular_2021_df.index.tolist() != ('first', 'second', 'third', 'fourth')), 'The index is not correct. Reread the instructions.'

## Exercise 2: Loading DataFrames from files

### 2.1 Load a dataset into a `ds_jobs` dataframe
Let's load a dataset with data about data science job applicants. It is a subset from a Kaggle dataset available [here](https://www.kaggle.com/datasets/arashnic/hr-analytics-job-change-of-data-scientists).

In [86]:
# Load the dataset from the file located at data/ds_jobs.csv.
ds_jobs = pd.read_csv('https://raw.githubusercontent.com/samsung-ai-course/6-7-edition/refs/heads/main/Data%20Wrangling/Data%20Wrangling%20-%20Pandas-101/data/ds_jobs.csv')


In [87]:
# Print the dataframe head() to get an idea of what you've just loaded.
ds_jobs.head()

Unnamed: 0,id,g,exp,enr,ed,m,y_exp,t_job,cdi
0,32403,Male,True,Full time course,Graduate,STEM,9,1,0.827
1,9858,Female,True,no_enrollment,Graduate,STEM,5,1,0.92
2,31806,Male,False,no_enrollment,High School,,<1,never,0.624
3,27385,Male,True,no_enrollment,Masters,STEM,11,1,0.827
4,27724,Male,True,no_enrollment,Graduate,STEM,>20,>4,0.92


In [88]:
assert(isinstance(ds_jobs, pd.DataFrame)), "Something is wrong. ds_jobs does not look like a dataframe."
assert(ds_jobs.shape == (1003, 9)), "The shape is not correct. Did you follow all the instructions in the comments?"
assert(sum(ds_jobs.columns == ['id', 'g', 'exp', 'enr', 'ed', 'm', 'y_exp', 't_job', 'cdi']) == 9), "The columns don't look right."
assert(ds_jobs.id[3] == 27385 and ds_jobs.id[552] == 13748), "The id looks wrong."
assert(ds_jobs.id.max() == 33343), "Something is wrong. Did you follow all the instructions in the comments?"
assert(ds_jobs.enr[446] == 'no_enrollment'), "Something is wrong. Did you follow all the instructions in the comments?"

### 2.2 Load a dataset, but this time better

Notice that the column names in the ds_jobs dataframe are not very informative. This is not very useful to someone looking at the data. Instead we want to load the dataset with the following `column names`:
- `'id'`
- `'gender'`
- `'relevant_experience'` - whether the candidate has experience in the field
- `'enrollment_type'` - full or part time
- `'education'` - highest attained education
- `'major'` - major subject at university
- `'years_of_experience'` - years of job experience
- `'time_since_last_job'` - years passed since last job
- `'city_development_index'` - development level of home city

In [89]:
# Load the file at 'data/ds_jobs.csv' into a dataframe ds_jobs.
# set the column names to 'id', 'gender', 'relevant_experience', 'enrollment_type', 'education', 'major',
# 'years_of_experience', 'time_since_last_job', 'city_development_index' in this order.
# You will need to check the documentation at
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html to see how you can do this.
ds_jobs = ds_jobs.rename(columns={'g': 'gender', 'exp': 'relevant_experience', 'enr': 'enrollment_type', 'ed': 'education', 'm': 'major', 'y_exp': 'years_of_experience', 't_job': 'time_since_last_job', 'cdi': 'city_development_index'})


In [90]:
# Print the dataframe head() to get an idea of what you've just loaded.
ds_jobs.head()

Unnamed: 0,id,gender,relevant_experience,enrollment_type,education,major,years_of_experience,time_since_last_job,city_development_index
0,32403,Male,True,Full time course,Graduate,STEM,9,1,0.827
1,9858,Female,True,no_enrollment,Graduate,STEM,5,1,0.92
2,31806,Male,False,no_enrollment,High School,,<1,never,0.624
3,27385,Male,True,no_enrollment,Masters,STEM,11,1,0.827
4,27724,Male,True,no_enrollment,Graduate,STEM,>20,>4,0.92


In [91]:
assert(isinstance(ds_jobs, pd.DataFrame)), "Something is wrong. ds_jobs does not look like a dataframe."
assert(ds_jobs.shape == (1003, 9)), "The shape is not correct. Did you follow all the instructions in the comments?"
assert(ds_jobs.shape != (1004, 9)), "Something is wrowg. You have 1 more row than expected. Did you tell pandas to use the 1st row as header?"
assert(sum(ds_jobs.columns == ['id', 'gender', 'relevant_experience', 'enrollment_type', 'education',
       'major', 'years_of_experience', 'time_since_last_job',
       'city_development_index']) == 9), "Don't forget to tell pandas the new column names."
assert(ds_jobs.id[6] == 21465 and ds_jobs.id[553] == 24331), "The index looks wrong."
assert(ds_jobs.education[5] == 'Masters'), "Something is wrong. Did you follow all the instructions in the comments?"
assert(ds_jobs.city_development_index.max() >= 0.949), "Something is wrong. Did you follow all the instructions in the comments?"
assert(ds_jobs.education[11] == 'Graduate'), "Something is wrong. Did you follow all the instructions in the comments?"

### 2.3 Preview the datatypes

In [92]:
# Store the datatypes of all columns of ds_jobs in ds_jobs_dtypes.
# Use the method you learned in the learning notebook.
ds_jobs_dtypes = ds_jobs.dtypes

# Note: if you used the correct method,
# the result will be a pandas series containing the datatypes of each column,
# with the index formed by the column names

# YOUR CODE HERE

# YOUR CODE HERE


In [93]:
# Check your output - there should be object, float, bool, and integer types.
ds_jobs_dtypes

Unnamed: 0,0
id,int64
gender,object
relevant_experience,bool
enrollment_type,object
education,object
major,object
years_of_experience,object
time_since_last_job,object
city_development_index,float64


In [94]:
assert(sum([x in ds_jobs_dtypes.index for x in ds_jobs.columns]) == 9), "The index of ds_jobs_dtypes should contain all columns in ds_jobs."
assert(hashlib.sha256(str(ds_jobs_dtypes['relevant_experience']).encode()).hexdigest() == 'b760f44fa5965c2474a3b471467a22c43185152129295af588b022ae50b50903'), "The dtype of column 'relevant_experience' is not as expected."

### 2.4 Set the correct datatypes
The datatypes in `ds_jobs` were infered, so all `strings` are set as `objects`. Convert all these datatypes to `string` using a function you learned in the learning notebook.

In [95]:
# Set the correct datatypes in the ds_jobs dataframe - convert the objects to strings.
# Store the new dtypes in the variable ds_jobs_dtypes_converted.
ds_jobs = ds_jobs.convert_dtypes(object,str)
#
ds_jobs_dtypes_converted = ds_jobs.dtypes

# YOUR CODE HERE

# YOUR CODE HERE


In [96]:
# Check you solution and compare it to the result of exercise 3.3. There will be pandas datatypes now (all or
# some, depending on which method you used).
ds_jobs_dtypes_converted

Unnamed: 0,0
id,Int64
gender,string[python]
relevant_experience,boolean
enrollment_type,string[python]
education,string[python]
major,string[python]
years_of_experience,string[python]
time_since_last_job,string[python]
city_development_index,Float64


In [99]:
assert(sum([x in ds_jobs_dtypes_converted.index for x in ds_jobs.columns]) == 9), "The index of ds_jobs_dtypes_converted should contain all columns in ds_jobs."
assert(hashlib.sha256(str(ds_jobs_dtypes_converted['relevant_experience'])[:4].encode()).hexdigest() == 'b760f44fa5965c2474a3b471467a22c43185152129295af588b022ae50b50903'), "The dtype of column 'relevant_experience' is not as expected."
assert(hashlib.sha256(str(ds_jobs_dtypes_converted['city_development_index']).lower().encode()).hexdigest() == '6bd2a66c4467bc379fd21e11d74bfa2b0f8205baf39eefc20b2c4fecb198dd48'), "The dtype of column 'city_development_index' is not as expected."
assert(hashlib.sha256(str(ds_jobs_dtypes_converted['time_since_last_job']).encode()).hexdigest()=='473287f8298dba7163a897908958f7c0eae733e25d2e027992ea2edc9bed2fa8'), "The dtype of column 'time_since_last_job' is not as expected."

### 2.5 Get information about the dataframe size
Use a method you learned in the learning notebook to retrieve the `number of rows` and the `number of columns` in the ds_jobs dataframe.



In [104]:
number_of_rows = ds_jobs.shape[0]
number_of_columns = ds_jobs.shape[1]

# YOUR CODE HERE



In [105]:
assert(hashlib.sha256(str(int(number_of_rows)).encode()).hexdigest() == '8c9a013ab70c0434313e3e881c310b9ff24aff1075255ceede3f2c239c231623'), "The number of rows is not correct."
assert(hashlib.sha256(str(int(number_of_columns)).encode()).hexdigest() == '19581e27de7ced00ff1ce50b2047e7a567c76b1cbaebabe5ef03f7c3017bb5b7'), "The number of columns is not correct."

### 2.6 Load a json file into a dataframe
Let's load a new dataframe called hdi from the file stored at `../data/HDI.json`. It's the human development index statistics in the years 1990-2019, a subset of a kaggle dataset available [here](https://www.kaggle.com/datasets/elmartini/human-development-index-historical-data).

In [114]:
# Load the datafi8le from data/HDI.json and store it in the variable hdi. Use the appropriate method for json files.
hdi = pd.read_json('/content/HDI.json')



In [115]:
# Preview your dataframe
hdi.head()

Unnamed: 0,HDI Rank,Country,1990,1995,2000,2005,2010,2015,2019
0,169,Afghanistan,0.302,0.331,0.35,0.418,0.472,0.5,0.511
1,69,Albania,0.65,0.637,0.671,0.706,0.745,0.788,0.795
2,91,Algeria,0.572,0.595,0.637,0.685,0.721,0.74,0.748
3,36,Andorra,,,0.813,0.827,0.837,0.862,0.868
4,148,Angola,,,0.4,0.46,0.517,0.572,0.581


In [116]:
assert(isinstance(hdi, pd.DataFrame)), "Something is wrong. hdi does not look like a dataframe."
assert(hdi.shape == (189, 9)), "The shape is not correct. Did you follow all the instructions in the comments?"
assert(sum(hdi.columns == ['HDI Rank', 'Country', '1990', '1995', '2000', '2005', '2010', '2015',
       '2019']) == 9), "The columns don't look right."
assert(hdi.Country[13] == 'Bangladesh' and hdi.Country[52] == 'El Salvador'), "The Country column looks wrong."
assert(sum(hdi['HDI Rank']) > 136), "Something is wrong. Did you follow all the instructions in the comments?"
assert(sum(hdi['HDI Rank']) == 17914), "Something is wrong. Did you follow all the instructions in the comments?"

### 2.7 Get a numpy array of column names
Store the names of the `columns` in the hdi dataframe as a `numpy array`.

In [118]:
# First extract the columns into hdi_cols.
hdi_cols = hdi.columns

# Then convert the output into a NumPy array.
hdi_cols_array = hdi_cols.to_numpy()



In [119]:
# Always preview your variables to see the result of the operations.
print(hdi_cols, type(hdi_cols), "\n", sep="\n")
print(hdi_cols_array, type(hdi_cols_array), sep="\n")

Index(['HDI Rank', 'Country', '1990', '1995', '2000', '2005', '2010', '2015',
       '2019'],
      dtype='object')
<class 'pandas.core.indexes.base.Index'>


['HDI Rank' 'Country' '1990' '1995' '2000' '2005' '2010' '2015' '2019']
<class 'numpy.ndarray'>


In [120]:
assert(isinstance(hdi_cols, pd.core.indexes.base.Index)), "Use the method you learned to extract the columns into hdi_cols."
assert(len(hdi_cols) == 9), "There are 9 columns in the hdi dataframe. Did you extract them all? Also, make sure you don't change the variable hdi."
assert(isinstance(hdi_cols_array, np.ndarray)), "The hdi_cols_array does not look like a numpy array."

### 2.8 Extract the index as a numpy array
Do the same as in exercise 2.7, but not for the index of hdi.

In [122]:
# Extract the index using the method you learned.
hdi_index = hdi.index

# Convert it to a numpy array.
hdi_index_array = hdi_index.to_numpy()



In [123]:
assert(isinstance(hdi_index, pd.core.indexes.base.Index)), "Use the method you learned to extract the index into hdi_index."
assert(len(hdi_index) == 189), "The length of the hdi_index variable is incorrect."
assert(sum(hdi_index_array) == 17766), "Something is wrong with the index array."
assert(isinstance(hdi_index_array, np.ndarray)), "The hdi_index_array does not look like a numpy array."

### 2.9 Describe the data in your dataframe
Last but not least, remember how you can get some stats and info on your dataframe? If you don't, make sure to reread the learning notebook. If you do, let's jump to this final exercise.

Using only the two methods you learned to get information and statistics on a dataframe answer the three questions in the cell below manually.

In [124]:
# Use this draft cell to print stuff to help you answer the questions below.
hdi.describe()

Unnamed: 0,HDI Rank,1990,1995,2000,2005,2015,2019
count,189.0,144.0,148.0,174.0,185.0,188.0,189.0
mean,94.783069,0.599653,0.615669,0.632776,0.659032,0.710511,0.722423
std,54.754486,0.16555,0.168708,0.169884,0.165277,0.151868,0.149791
min,1.0,0.22,0.231,0.262,0.294,0.372,0.394
25%,48.0,0.48025,0.48,0.47575,0.512,0.586,0.602
50%,95.0,0.628,0.6375,0.663,0.689,0.7375,0.74
75%,142.0,0.732,0.749,0.77175,0.791,0.8245,0.829
max,189.0,0.871,0.888,0.915,0.931,0.947,0.957


In [133]:
# Question 1
# What is the mean value for HDI in the year 2019 (rounded to 2 decimal points)?
mean_HDI_2019 = hdi.describe().loc['mean', '2019']

# Question 2
# What is the maximum value for HDI in the year 1995 (round to 2 decimal points)?
max_HDI_1995 = round( hdi.describe().loc['max', '1995'],2)

# Question 3
# How many non-null entries do we have for the year 1990? Store the answer as an integer.
nonnull_HDI_1990 = hdi.loc[hdi['1990'].notnull()].shape[0]


In [134]:
assert (isinstance(mean_HDI_2019, float)), "mean_HDI_2019 should be a float."
assert (isinstance(max_HDI_1995, float)), "max_HDI_1995 should be a float."
assert (isinstance(nonnull_HDI_1990, int)), "nonnull_HDI_1990 should be an integer."
np.testing.assert_almost_equal(float(mean_HDI_2019), 0.72, 2), "mean_HDI_2019 does not look right."
np.testing.assert_almost_equal(float(max_HDI_1995), 0.89, 2), "max_HDI_1995 does not look right."
assert(hashlib.sha256(str(int(nonnull_HDI_1990)).encode()).hexdigest() == '5ec1a0c99d428601ce42b407ae9c675e0836a8ba591c8ca6e2a2cf5563d97ff0'), "nonnull_HDI_1990 does not look right."