# Importing Data Sets II

## Setup

<p>In this case, we will be using the following libraries:</p>
<ul>
    <li><code>skillsnetwork</code> for downloading the dataset.</li>
    <li><code>pandas</code> for managing the data.</li>
    <li><code>numpy</code> for mathematical operations.</li>
</ul>

### Importing Required Libraries

In [1]:
import pandas as pd
import numpy as np
import requests
import os

In [2]:
dir_path = os.path.join(".", "data")
os.makedirs(dir_path, exist_ok=True)

def download_file(url: str) -> str:
    with requests.get(url=url, stream=True) as response:
        response.raise_for_status()

        filepath = os.path.join(dir_path, url.rsplit("/", 1)[-1])
        total_size = int(response.headers.get("Content-Length", 0))
        chunk_size = 1024 ** 2
        download_size = 0

        with open(filepath, "wb") as file:
            for chunk in response.iter_content(chunk_size=chunk_size):
                if not chunk:
                    continue

                file.write(chunk)
                download_size += len(chunk)

                if total_size > 0:
                    progress = (download_size / total_size) * 100
                    print(f"Downloading: {progress:.2f}% ({download_size} / {total_size} bytes)")

    print("Download Complete.")
    return filepath

<p>To obtain the dataset, utilize the <code>download_file()</code> function as defined above.</p>

In [3]:
data_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DA0101EN-Coursera/laptop_pricing_dataset_base.csv"
filename = download_file(data_url)

Downloading: 100.00% (11512 / 11512 bytes)
Download Complete.


## Task 1 - Load the dataset to a pandas dataframe named "df"

<p>Print the first 5 entries of the dataset to confirm loading.</p>

In [4]:
df = pd.read_csv(filename, header=None)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,Acer,4,IPS Panel,2,1,5,35.56,1.6,8,256,1.6,978
1,Dell,3,Full HD,1,1,3,39.624,2.0,4,256,2.2,634
2,Dell,3,Full HD,1,1,7,39.624,2.7,8,256,2.2,946
3,Dell,4,IPS Panel,2,1,5,33.782,1.6,8,128,1.22,1244
4,HP,4,Full HD,2,1,7,39.624,1.8,8,256,1.91,837


## Task 2 - Add headers to the dataframe

<p>The headers for the dataset, in sequence, are "Manufacturer", "Category", "Screen", "GPU", "OS", "CPU_core", "Screen_Size_inch", "CPU_frequency", "RAM_GB", "Storage_GB_SSD", "Weight_kg" and "Price".</p>
<p>Confirm insertion by printing the first 10 rows of the dataset.</p>

In [5]:
headers = [
    "Manufacturer", "Category", "Screen", "GPU", "OS", "CPU_core", "Screen_Size_inch", "CPU_frequency", "RAM_GB", "Storage_GB_SSD", "Weight_kg", "Price"
]
df.columns = headers
df.head(10)

Unnamed: 0,Manufacturer,Category,Screen,GPU,OS,CPU_core,Screen_Size_inch,CPU_frequency,RAM_GB,Storage_GB_SSD,Weight_kg,Price
0,Acer,4,IPS Panel,2,1,5,35.56,1.6,8,256,1.6,978
1,Dell,3,Full HD,1,1,3,39.624,2.0,4,256,2.2,634
2,Dell,3,Full HD,1,1,7,39.624,2.7,8,256,2.2,946
3,Dell,4,IPS Panel,2,1,5,33.782,1.6,8,128,1.22,1244
4,HP,4,Full HD,2,1,7,39.624,1.8,8,256,1.91,837
5,Dell,3,Full HD,1,1,5,39.624,1.6,8,256,2.2,1016
6,HP,3,Full HD,3,1,5,39.624,1.6,8,256,2.1,1117
7,Acer,3,IPS Panel,2,1,5,38.1,1.6,4,256,2.2,866
8,Dell,3,Full HD,1,1,5,39.624,2.5,4,256,2.3,812
9,Acer,3,IPS Panel,3,1,7,38.1,1.8,8,256,2.2,1068


## Task 3 - Replace `?` with `NaN`

<p>Replace the <code>?</code> entries in the dataset with <code>NaN</code> value, received from the <code>NumPy</code> package.</p>

In [6]:
df.replace("?", np.nan, inplace=True)

## Task 4 - Print the data types of the dataframe columns

<p>Make a note of the data types of the different columns of the dataset.</p>

In [7]:
df.dtypes

Manufacturer         object
Category              int64
Screen               object
GPU                   int64
OS                    int64
CPU_core              int64
Screen_Size_inch     object
CPU_frequency       float64
RAM_GB                int64
Storage_GB_SSD        int64
Weight_kg            object
Price                 int64
dtype: object

## Task 5

<p>Print the statistical description of the dataset, including that of "object" data types.</p>

In [8]:
df.describe(include="all")

Unnamed: 0,Manufacturer,Category,Screen,GPU,OS,CPU_core,Screen_Size_inch,CPU_frequency,RAM_GB,Storage_GB_SSD,Weight_kg,Price
count,238,238.0,238,238.0,238.0,238.0,234.0,238.0,238.0,238.0,233.0,238.0
unique,11,,2,,,,9.0,,,,77.0,
top,Dell,,Full HD,,,,39.624,,,,2.2,
freq,71,,161,,,,89.0,,,,21.0,
mean,,3.205882,,2.151261,1.058824,5.630252,,2.360084,7.882353,245.781513,,1462.344538
std,,0.776533,,0.638282,0.23579,1.241787,,0.411393,2.482603,34.765316,,574.607699
min,,1.0,,1.0,1.0,3.0,,1.2,4.0,128.0,,527.0
25%,,3.0,,2.0,1.0,5.0,,2.0,8.0,256.0,,1066.5
50%,,3.0,,2.0,1.0,5.0,,2.5,8.0,256.0,,1333.0
75%,,4.0,,3.0,1.0,7.0,,2.7,8.0,256.0,,1777.0


## Task 6

<p>Print the summary information of the dataset.</p>

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 238 entries, 0 to 237
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Manufacturer      238 non-null    object 
 1   Category          238 non-null    int64  
 2   Screen            238 non-null    object 
 3   GPU               238 non-null    int64  
 4   OS                238 non-null    int64  
 5   CPU_core          238 non-null    int64  
 6   Screen_Size_inch  234 non-null    object 
 7   CPU_frequency     238 non-null    float64
 8   RAM_GB            238 non-null    int64  
 9   Storage_GB_SSD    238 non-null    int64  
 10  Weight_kg         233 non-null    object 
 11  Price             238 non-null    int64  
dtypes: float64(1), int64(7), object(4)
memory usage: 22.4+ KB


****
This is the end of the file.
****