In [1]:
# Python Packages for Data Science:
# A Python library is a collection of functions and methods that allow you to perform lots of actions without writing 
# any code.
# The libraries usually contain built-in modules providing different functionalities, which you can use directly.
# And there are extensive libraries, offering a broad range of facilities.
# We can divide the Python data analysis libraries into three groups:

In [2]:
# I. Scientific computing libraries:
# - Pandas: It offers data structure and tools for effective data manipulation and analysis. It provides fast access 
#           to structured data. The primary instrument of Pandas is a two-dimensional table consisting of column and row 
#           labels, which are called a DataFrame. It is designed to provide easy indexing functionality.
# - Numpy : It uses arrays for its inputs and outputs. It can be extended to objects for matrices, and with minor coding 
#           changes, developers can perform fast array processing.
# - SciPy : It includes functions for some advanced math problems like Integrals, differential equations and optimisation,
#           as well as data visualization. 

In [3]:
# II. Visualization libraries:
# Using data visualization methods is the best way to communicate with others, showing them meaningful results of analysis.
# These libraries enable you to create graphs, charts and maps.
# - Matplotlib - It is the most well-known library for data visualization. It is great for making graphs and plots. 
#                The graphs are also highly customizable.
# - Seaborn    - It is based on Matplotlib. It's very easy to generate various plots such as heat maps, time series, 
#                and violin plots.

In [6]:
# III. Algorithmic libraries:
# With Machine Learning algorithms, we're able to develop a model using our dataset, and obtain predictions.
# The algorithmic libraries tackle some machine learning tasks from basic to complex.
# - Scikit-learn : It contains tools for statistical modeling, including regression, classification, clustering and so on.
#                  This library is built on NumPy, SciPy and Matplotlib.
# - StatsModels  : It allows users to explore data, estimate statistical models, and perform statistical tests.

In [8]:
import pandas as pd

In [15]:
# Importing Data:
# Data acquisition is a process of loading and reading data into notebook from various sources.
# To read any data using Python’s pandas package, there are two important factors to consider:
# Format - It is the way data is encoded. We can usually tell different encoding schemes by looking at the ending of 
# the file name. Some common encodings are csv, json, xlsx, hdf and so forth.
# File path - It tells us where the data is stored. Usually it is stored either on the computer we are using, or online on
# the internet.

# Each row is one data point.

# In pandas, the “read_csv()” method can read in files with columns separated by commas into a pandas DataFrame.
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data" 
# The path of the file in the local system can also be given in the same way.
# The read_csv() by default assumes that our data has a header. But our data does not have one. So we specify that.
df = pd.read_csv(url, header=None)

# dataframe.head(n) shows the starting n rows of data frame.
# dataframe.tail(n) shows the bottom n rows of data frame.
df.head()

# It is always better to have column names for our data. In our case, the column names are in a separate place online. 
# https://archive.ics.uci.edu/ml/datasets/Automobile
# So, we first store the column names as a list.
headers = ["symboling","normalized-losses","make","fuel-type","aspiration", "num-of-doors","body-style",
         "drive-wheels","engine-location","wheel-base", "length","width","height","curb-weight","engine-type",
         "num-of-cylinders", "engine-size","fuel-system","bore","stroke","compression-ratio","horsepower",
         "peak-rpm","city-mpg","highway-mpg","price"]
# To set these headers as column names.
df.columns = headers
df.head() # we can now see the column names.

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


In [30]:
# Exporting Data:
# We can export the pandas dataframe to a new CSV file using the ”to_csv()" method.
df.to_csv()

In [35]:
# Exploring the data:
# To explore the dataset, pandas provides us with different methods.
# Using these methods gives an overview of the dataset. And also point out potential issues, such as the wrong datatype 
# of features, which may need to be resolved later on.

# Data has a variety of types. The main types stored in Pandas objects are object, float, int, and datetime.
# The datatype names are somewhat different from those in native Python.
# +---------------+--------------------+--------------------+
# |  Pandas Type  | Native Python type |     Description    |
# +---------------+--------------------+--------------------+
# |     Object    |       String       |  number and string |
# +---------------+--------------------+--------------------+
# |     int64     |      Falseint      | numeric characters |
# +---------------+--------------------+--------------------+
# |    float64    |        float       | Numeric characters |
# |               |                    |    with decimals   |
# +---------------+--------------------+--------------------+
# |  datetime64,  |         N/A        |      time data     |
# | timedelta[ns] |                    |                    |
# +---------------+--------------------+--------------------+

# There are two reasons to check data types in a datase:
#     > Pandas automatically assigns types based on the encoding it detects from the original data table.
#     For a number of reasons, this assignment may be incorrect.
#     > It allows an experienced data scientist to see which Python functions can be applied to a specific column.

# The dataype of each column can be found using the dtypes data attribute.
# The datatype of each column is returned in a Series.
df.dtypes

# We can check the statistical summary of each column using describe().
# This helps us to learn about the distribution of data in each column. The statistical metrics can tell the data scientist
# if there are mathematical issues that may exist, such as extreme outliers and large deviations.
df.describe()
# By default, the describe() function skips rows and columns that do not contain numbers.
# It is possible to make the describe method work for object-type columns as well.
df.describe(include="all")
# We see that for object-type columns, a different set of statistics is evaluated:
# > unique - the number of distinct objects in the column
# > top - the most frequently occurring object
# > frequency- the number of times the top object appears in the column.

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
count,205.0,205,205,205,205,205,205,205,205,205.0,...,205.0,205,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205
unique,,52,22,2,2,3,5,3,2,,...,,8,39.0,37.0,,60.0,24.0,,,187
top,,?,toyota,gas,std,four,sedan,fwd,front,,...,,mpfi,3.62,3.4,,68.0,5500.0,,,?
freq,,41,32,185,168,114,96,120,202,,...,,94,23.0,20.0,,19.0,37.0,,,4
mean,0.834146,,,,,,,,,98.756585,...,126.907317,,,,10.142537,,,25.219512,30.75122,
std,1.245307,,,,,,,,,6.021776,...,41.642693,,,,3.97204,,,6.542142,6.886443,
min,-2.0,,,,,,,,,86.6,...,61.0,,,,7.0,,,13.0,16.0,
25%,0.0,,,,,,,,,94.5,...,97.0,,,,8.6,,,19.0,25.0,
50%,1.0,,,,,,,,,97.0,...,120.0,,,,9.0,,,24.0,30.0,
75%,2.0,,,,,,,,,102.4,...,141.0,,,,9.4,,,30.0,34.0,
