# **IMPORT DATA SET** 

### **Objectives** 

 *Acquire data in various ways
 
 *Obtain insights from data with Pandas library


**Table of Contents**

-Data Acquisition

-Basic Insights from the Data set

**Data Acquisition**

A data set is typically a file containing data stored in one of several formats. Common file formats containing data sets include: .csv, .json, .xlsx etc. The data set can be stored in different places, on your local machine, on a server or a websiite, cloud storage and so on.
To analyse data in a Python notebook, we need to bring the data set into the notebook. In this section, you will learn how to load a data set into our Jupyter Notebook.

In our case, the Automobile Data set is an online source, and it is in a CSV (comma separated value) format. Let's use this data set as an example to practice reading data.

 -Data source: https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data
            
 -Data type: csv
    
The Pandas Library is a very popular and very useful tool that enables us to read various datasets into a data frame; our Jupyter notebook platforms have a built-in Pandas Library so that all we need to do is import Pandas without installing.

In [1]:
import pandas as pd
import numpy as np

Note: This JupyterLite requires the dataset to be downloaded to the interface. While working on the downloaded version of this notebook on their local machines and simply use the URL directly in the pandas.read_csv() function. 

The functions below will download the dataset into your browser:

In [6]:
filepath = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DA0101EN-Coursera/laptop_pricing_dataset_base.csv"
df = pd.read_csv(filepath, header=None)

Task #1: 
Load the dataset to a pandas dataframe named 'df'
Print the first 5 entries of the dataset to confirm loading.

In [8]:
df = pd.read_csv("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DA0101EN-Coursera/laptop_pricing_dataset_base.csv", header=None)
print(df.head())

     0   1          2   3   4   5       6    7   8    9     10    11
0  Acer   4  IPS Panel   2   1   5   35.56  1.6   8  256   1.6   978
1  Dell   3    Full HD   1   1   3  39.624  2.0   4  256   2.2   634
2  Dell   3    Full HD   1   1   7  39.624  2.7   8  256   2.2   946
3  Dell   4  IPS Panel   2   1   5  33.782  1.6   8  128  1.22  1244
4    HP   4    Full HD   2   1   7  39.624  1.8   8  256  1.91   837


### Add Headers

Take a look at the data set. Pandas automatically set the header with an integer starting from 0.

To better describe the data, you can introduce a header. This information is available at: https://archive.ics.uci.edu/ml/datasets/Automobile.

Thus, you have to add headers manually.

First, create a list "headers" that include all column names in order. Then, use dataframe.columns = headers to replace the headers with the list you created.

Task #2: 
Add headers to the dataframe
The headers for the dataset, in sequence, are "Manufacturer", "Category", "Screen", "GPU", "OS", "CPU_core", "Screen_Size_inch", "CPU_frequency", "RAM_GB", "Storage_GB_SSD", "Weight_kg" and "Price".
Confirm insertion by printing the first 10 rows of the dataset.

In [9]:
headers = ["Manufacturer", "Category", "Screen", "GPU", "OS", "CPU_core", "Screen_Size_inch", "CPU_frequency", "RAM_GB", "Storage_GB_SSD", "Weight_kg", "Price"]
df.columns = headers
print(df.head(10))

  Manufacturer  Category     Screen  GPU  OS  CPU_core Screen_Size_inch  \
0         Acer         4  IPS Panel    2   1         5            35.56   
1         Dell         3    Full HD    1   1         3           39.624   
2         Dell         3    Full HD    1   1         7           39.624   
3         Dell         4  IPS Panel    2   1         5           33.782   
4           HP         4    Full HD    2   1         7           39.624   
5         Dell         3    Full HD    1   1         5           39.624   
6           HP         3    Full HD    3   1         5           39.624   
7         Acer         3  IPS Panel    2   1         5             38.1   
8         Dell         3    Full HD    1   1         5           39.624   
9         Acer         3  IPS Panel    3   1         7             38.1   

   CPU_frequency  RAM_GB  Storage_GB_SSD Weight_kg  Price  
0            1.6       8             256       1.6    978  
1            2.0       4             256       2.2    

## Task 3:

Replace '?' with 'NaN'

Replace the '?' entries in the dataset with NaN value, recevied from the Numpy package.

In [10]:
df.replace('?',np.nan, inplace = True)

## Save Dataset

Correspondingly, Pandas enables you to save the data set to CSV. By using the dataframe.to_csv() method, you can add the file path and name along with quotation marks in the brackets.

For example, if you save the data frame df as automobile.csv to your local machine, you may use the syntax below, where index = False means the row names will not be written.

df.to_csv("automobile.csv", index=False)

In [None]:
df.to_csv("automobile.csv", index=False)

You can also read and save other file formats. You can use similar functions like **pd.read_csv() and df.to_csv()** for other data formats. The functions are listed in the following table:

# Read/Save Other Data Formats
                                                                Data Formate	Read	     Save
                                                                csv	        pd.read_csv()	 df.to_csv()
                                                                json	    pd.read_json()	 df.to_json()
                                                                excel	    pd.read_excel()	 df.to_excel()
                                                                hdf	        pd.read_hdf()	 df.to_hdf()
                                                                sql	        pd.read_sql()	 df.to_sql()
                                                                            


## Data Types

Data has a variety of types.

The main types stored in Pandas data frames are object, float, int, bool and datetime64. In order to better learn about each attribute, you should always know the data type of each column. In Pandas:

## Task 4: 

Print the data types of the dataframe columns
Make a note of the data types of the different columns of the dataset.

In [11]:
print(df.dtypes)

Manufacturer         object
Category              int64
Screen               object
GPU                   int64
OS                    int64
CPU_core              int64
Screen_Size_inch     object
CPU_frequency       float64
RAM_GB                int64
Storage_GB_SSD        int64
Weight_kg            object
Price                 int64
dtype: object


## Describe

A statistical summary of each column such as count, column mean value, column standard deviation, etc., use the describe method:

## Task 5:  

Print the statistical description of the dataseta ., use the describe method:

In [12]:
print(df.describe(include='all'))

       Manufacturer    Category   Screen         GPU          OS    CPU_core  \
count           238  238.000000      238  238.000000  238.000000  238.000000   
unique           11         NaN        2         NaN         NaN         NaN   
top            Dell         NaN  Full HD         NaN         NaN         NaN   
freq             71         NaN      161         NaN         NaN         NaN   
mean            NaN    3.205882      NaN    2.151261    1.058824    5.630252   
std             NaN    0.776533      NaN    0.638282    0.235790    1.241787   
min             NaN    1.000000      NaN    1.000000    1.000000    3.000000   
25%             NaN    3.000000      NaN    2.000000    1.000000    5.000000   
50%             NaN    3.000000      NaN    2.000000    1.000000    5.000000   
75%             NaN    4.000000      NaN    3.000000    1.000000    7.000000   
max             NaN    5.000000      NaN    3.000000    2.000000    7.000000   

       Screen_Size_inch  CPU_frequency 

# Task 6:

Print the summary information of the dataset.¶

In [13]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 238 entries, 0 to 237
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Manufacturer      238 non-null    object 
 1   Category          238 non-null    int64  
 2   Screen            238 non-null    object 
 3   GPU               238 non-null    int64  
 4   OS                238 non-null    int64  
 5   CPU_core          238 non-null    int64  
 6   Screen_Size_inch  234 non-null    object 
 7   CPU_frequency     238 non-null    float64
 8   RAM_GB            238 non-null    int64  
 9   Storage_GB_SSD    238 non-null    int64  
 10  Weight_kg         233 non-null    object 
 11  Price             238 non-null    int64  
dtypes: float64(1), int64(7), object(4)
memory usage: 22.4+ KB
None
