# Pandas
- Open source Python library providing high-performance data manipulation and analysis tool using its powerful data structures.
- Pandas is derived from the word Panel Data - an econometrics term for multidimensional data

## Dataframes
- Two-dimensional size
- mutable
- potentially heterogeneous tabular data structure with labeled axes (row indices and column names)

### importing necessary libraries

In [2]:
import os
import pandas as pd # to work with dataframes
import numpy as np # to perform numeric computations

### importing data

In [3]:
cars_data = pd.read_csv('Toyota.csv')

In [4]:
print(cars_data.head())

   Unnamed: 0  Price   Age     KM FuelType  HP  MetColor  Automatic    CC  \
0           0  13500  23.0  46986   Diesel  90       1.0          0  2000   
1           1  13750  23.0  72937   Diesel  90       1.0          0  2000   
2           2  13950  24.0  41711   Diesel  90       NaN          0  2000   
3           3  14950  26.0  48000   Diesel  90       0.0          0  2000   
4           4  13750  30.0  38500   Diesel  90       0.0          0  2000   

   Doors  Weight  
0  three    1165  
1      3    1165  
2      3    1165  
3      3    1165  
4      3    1170  


In [5]:
cars_data = pd.read_csv('Toyota.csv',index_col=0)

In [6]:
print(cars_data.head())

   Price   Age     KM FuelType  HP  MetColor  Automatic    CC  Doors  Weight
0  13500  23.0  46986   Diesel  90       1.0          0  2000  three    1165
1  13750  23.0  72937   Diesel  90       1.0          0  2000      3    1165
2  13950  24.0  41711   Diesel  90       NaN          0  2000      3    1165
3  14950  26.0  48000   Diesel  90       0.0          0  2000      3    1165
4  13750  30.0  38500   Diesel  90       0.0          0  2000      3    1170


### creating copy of original data
- In python there are 2 ways to create copies
   1. Shallow copy
   2. Deep copy

#### shallow copy
- sample = cars_data.copy(deep=False)
- sample = cars_data
- It only creates a new variable that shares the reference of the original object.
- Any changes made to a copy of object will be reflected in the original object as well.

#### deep copy
- sample = cars_data.copy(deep=True)
- A copy of object is copied in other object with no reference to the original.
- Any changes made to a copy of object will not be reflected in the original object.

In [8]:
# creating a deep copy
cars_data1 = cars_data.copy()

### Attributes of data

#### DataFrame.index : To get the index (row labels) of the dataframe.

In [9]:
print(cars_data1.index)

Index([   0,    1,    2,    3,    4,    5,    6,    7,    8,    9,
       ...
       1426, 1427, 1428, 1429, 1430, 1431, 1432, 1433, 1434, 1435],
      dtype='int64', length=1436)


#### DataFrame.columns : To get the column labels of the dataframe

In [10]:
print(cars_data1.columns)

Index(['Price', 'Age', 'KM', 'FuelType', 'HP', 'MetColor', 'Automatic', 'CC',
       'Doors', 'Weight'],
      dtype='object')


#### Dataframe.size : To get the total number of elements from the dataframe

In [11]:
cars_data1.size

14360

#### DataFrame.shape : To get the dimensionality of the dataframe

In [12]:
cars_data1.shape

(1436, 10)

In [13]:
# no. of rows = 1436
# no. of cols = 10

#### DataFrame.memory_usage() : To get the memory usage of each col in bytes

In [14]:
cars_data1.memory_usage()

Index        11488
Price        11488
Age          11488
KM           11488
FuelType     11488
HP           11488
MetColor     11488
Automatic    11488
CC           11488
Doors        11488
Weight       11488
dtype: int64

#### DataFrame.ndim : The number of axes/array dimensions

In [15]:
cars_data1.ndim

2

In [16]:
# A two dimensional array stores data in rows and coumns.

### Indexing and selecting data

#### DataFrame.head(n) : returns first n rows from the dataframe

In [17]:
cars_data1.head(6)

Unnamed: 0,Price,Age,KM,FuelType,HP,MetColor,Automatic,CC,Doors,Weight
0,13500,23.0,46986,Diesel,90,1.0,0,2000,three,1165
1,13750,23.0,72937,Diesel,90,1.0,0,2000,3,1165
2,13950,24.0,41711,Diesel,90,,0,2000,3,1165
3,14950,26.0,48000,Diesel,90,0.0,0,2000,3,1165
4,13750,30.0,38500,Diesel,90,0.0,0,2000,3,1170
5,12950,32.0,61000,Diesel,90,0.0,0,2000,3,1170


In [18]:
# by default, the head() returns the first 5 rows of the data frame.

#### DataFrame.tail(n) : returns the last n rows from the dataframe

In [19]:
cars_data1.tail(6)

Unnamed: 0,Price,Age,KM,FuelType,HP,MetColor,Automatic,CC,Doors,Weight
1430,8450,80.0,23000,Petrol,86,0.0,0,1300,3,1015
1431,7500,,20544,Petrol,86,1.0,0,1300,3,1025
1432,10845,72.0,??,Petrol,86,0.0,0,1300,3,1015
1433,8500,,17016,Petrol,86,0.0,0,1300,3,1015
1434,7250,70.0,??,,86,1.0,0,1300,3,1015
1435,6950,76.0,1,Petrol,110,0.0,0,1600,5,1114


#### To access a scalar value, the fastest way is to use the at and iat methods.

In [20]:
# at provides label-based scalar lookups
cars_data1.at[0,'Age']

23.0

In [21]:
# at provides integer-based scalar lookups
cars_data1.iat[0,1]

23.0

#### To access a group of rows and columns by label(s) .loc[] can be used

In [22]:
cars_data1.loc[:,'FuelType']

0       Diesel
1       Diesel
2       Diesel
3       Diesel
4       Diesel
         ...  
1431    Petrol
1432    Petrol
1433    Petrol
1434       NaN
1435    Petrol
Name: FuelType, Length: 1436, dtype: object

In [26]:
cars_data1.loc[5:10,['Age','KM']]

Unnamed: 0,Age,KM
5,32.0,61000
6,27.0,??
7,30.0,75889
8,27.0,19700
9,23.0,71138
10,25.0,31461


## Data types
- There are two main types of data:
   1. Numeric : integer and float
   2. character : category and object

pandas : int64 and float64. 64 bits is equivalent to 8 bytes.

### Checking the data types of each column
- dtypes returns a series with the data type of each column
- DataFrame.dtypes

In [27]:
cars_data1.dtypes

Price          int64
Age          float64
KM            object
FuelType      object
HP            object
MetColor     float64
Automatic      int64
CC             int64
Doors         object
Weight         int64
dtype: object

### Count of unique data types
- get_dtype_counts() returns counts of unique data types in the dataframe
- DataFrame.get_dtype_counts()

In [30]:
cars_data1.dtypes.value_counts()

int64      4
object     4
float64    2
Name: count, dtype: int64

### Selecting data based on data types
- pandas.DataFrame.select_dtypes() : returns a series of the columns from dataframe based on the column dtypes

In [33]:
cars_data1.select_dtypes(exclude=[object,float])

Unnamed: 0,Price,Automatic,CC,Weight
0,13500,0,2000,1165
1,13750,0,2000,1165
2,13950,0,2000,1165
3,14950,0,2000,1165
4,13750,0,2000,1170
...,...,...,...,...
1431,7500,0,1300,1025
1432,10845,0,1300,1015
1433,8500,0,1300,1015
1434,7250,0,1300,1015


### Concise summary of dataframe
- DataFrame.info()

In [34]:
cars_data1.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1436 entries, 0 to 1435
Data columns (total 10 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Price      1436 non-null   int64  
 1   Age        1336 non-null   float64
 2   KM         1436 non-null   object 
 3   FuelType   1336 non-null   object 
 4   HP         1436 non-null   object 
 5   MetColor   1286 non-null   float64
 6   Automatic  1436 non-null   int64  
 7   CC         1436 non-null   int64  
 8   Doors      1436 non-null   object 
 9   Weight     1436 non-null   int64  
dtypes: float64(2), int64(4), object(4)
memory usage: 155.7+ KB


### Unique elements of columns
- Syntax : numpy.unique(array)

In [39]:
np.unique(cars_data1['KM'])

array(['1', '10000', '100123', ..., '99865', '99971', '??'], dtype=object)

In [40]:
np.unique(cars_data1['HP'])

array(['107', '110', '116', '192', '69', '71', '72', '73', '86', '90',
       '97', '98', '????'], dtype=object)

In [41]:
np.unique(cars_data1['MetColor'])

array([ 0.,  1., nan])

In [42]:
np.unique(cars_data1['Automatic'])

array([0, 1], dtype=int64)

In [43]:
np.unique(cars_data1['Doors'])

array(['2', '3', '4', '5', 'five', 'four', 'three'], dtype=object)

## Importing data

In [1]:
import pandas as pd
cars_data = pd.read_csv('Toyota.csv',index_col=0, na_values=['??','????']);
print(cars_data.head())

   Price   Age       KM FuelType    HP  MetColor  Automatic    CC  Doors  \
0  13500  23.0  46986.0   Diesel  90.0       1.0          0  2000  three   
1  13750  23.0  72937.0   Diesel  90.0       1.0          0  2000      3   
2  13950  24.0  41711.0   Diesel  90.0       NaN          0  2000      3   
3  14950  26.0  48000.0   Diesel  90.0       0.0          0  2000      3   
4  13750  30.0  38500.0   Diesel  90.0       0.0          0  2000      3   

   Weight  
0    1165  
1    1165  
2    1165  
3    1165  
4    1170  


In [2]:
# creating a deep copy
cars_data2 = cars_data.copy()

In [3]:
cars_data2.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1436 entries, 0 to 1435
Data columns (total 10 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Price      1436 non-null   int64  
 1   Age        1336 non-null   float64
 2   KM         1421 non-null   float64
 3   FuelType   1336 non-null   object 
 4   HP         1430 non-null   float64
 5   MetColor   1286 non-null   float64
 6   Automatic  1436 non-null   int64  
 7   CC         1436 non-null   int64  
 8   Doors      1436 non-null   object 
 9   Weight     1436 non-null   int64  
dtypes: float64(4), int64(4), object(2)
memory usage: 123.4+ KB


### Converting variable's data types
- Syntax : DataFrame.astype(dtype)

In [4]:
cars_data2['MetColor'] = cars_data2['MetColor'].astype('object')
cars_data2['Automatic'] = cars_data2['Automatic'].astype('object')

In [5]:
cars_data2['FuelType'].nbytes

11488

In [6]:
cars_data2['FuelType'].astype('category').nbytes

1460

In [7]:
cars_data2.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1436 entries, 0 to 1435
Data columns (total 10 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Price      1436 non-null   int64  
 1   Age        1336 non-null   float64
 2   KM         1421 non-null   float64
 3   FuelType   1336 non-null   object 
 4   HP         1430 non-null   float64
 5   MetColor   1286 non-null   object 
 6   Automatic  1436 non-null   object 
 7   CC         1436 non-null   int64  
 8   Doors      1436 non-null   object 
 9   Weight     1436 non-null   int64  
dtypes: float64(3), int64(3), object(4)
memory usage: 123.4+ KB


### Cleaning column 'Doors'

In [9]:
import numpy as np;
np.unique(cars_data2['Doors'])

array(['2', '3', '4', '5', 'five', 'four', 'three'], dtype=object)

#### replace() is used to replace a value with the desired value

In [10]:
cars_data2['Doors'].replace('three',3,inplace=True)
cars_data2['Doors'].replace('four',4,inplace=True)
cars_data2['Doors'].replace('five',5,inplace=True)

In [12]:
cars_data2.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1436 entries, 0 to 1435
Data columns (total 10 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Price      1436 non-null   int64  
 1   Age        1336 non-null   float64
 2   KM         1421 non-null   float64
 3   FuelType   1336 non-null   object 
 4   HP         1430 non-null   float64
 5   MetColor   1286 non-null   object 
 6   Automatic  1436 non-null   object 
 7   CC         1436 non-null   int64  
 8   Doors      1436 non-null   object 
 9   Weight     1436 non-null   int64  
dtypes: float64(3), int64(3), object(4)
memory usage: 123.4+ KB


In [15]:
cars_data2['Doors']=cars_data2['Doors'].astype('int64')

In [16]:
cars_data2.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1436 entries, 0 to 1435
Data columns (total 10 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Price      1436 non-null   int64  
 1   Age        1336 non-null   float64
 2   KM         1421 non-null   float64
 3   FuelType   1336 non-null   object 
 4   HP         1430 non-null   float64
 5   MetColor   1286 non-null   object 
 6   Automatic  1436 non-null   object 
 7   CC         1436 non-null   int64  
 8   Doors      1436 non-null   int64  
 9   Weight     1436 non-null   int64  
dtypes: float64(3), int64(4), object(3)
memory usage: 123.4+ KB


### To detect missing values
- DataFrame.isnull().sum()

In [17]:
cars_data2.isnull().sum()

Price          0
Age          100
KM            15
FuelType     100
HP             6
MetColor     150
Automatic      0
CC             0
Doors          0
Weight         0
dtype: int64