## CSIT 440 Data Preprocessing Example in Python


#### Recommended Online Book
[Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/index.html)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### About the data set

This file **Cars93.csv** contains the data set from the MASS package of R. It includes 93 rows and 27 columns. 

Cars were selected at random from among 1993 passenger car models that were listed in both the Consumer Reports issue and the PACE Buying Guide. Pickup trucks and Sport/Utility vehicles were eliminated due to incomplete information in the Consumer Reports source. Duplicate models (e.g., Dodge Shadow and Plymouth Sundance) were listed at most once. 

Source: 
Lock, R. H. (1993) 1993 New Car Data. Journal of Statistics Education 1(1). https://doi.org/10.1080/10691898.1993.11910459

Semetics of the columns:

1. Manufacturer.

2. Model.

3. Type: a factor with levels "Small", "Sporty", "Compact", "Midsize", "Large" and "Van".

4. Min.Price: Minimum Price (in $1,000): price for a basic version.

5. Price: Midrange Price (in $1,000): average of Min.Price and Max.Price.

6. Max.Price: Maximum Price (in \$1,000): price for “a premium version”.

7. MPG.city: City MPG (miles per US gallon by EPA rating).

8. MPG.highway: Highway MPG.

9. AirBags: Air Bags standard. Factor: none, driver only, or driver & passenger.

10. DriveTrain: Drive train type: rear wheel, front wheel or 4WD; (factor).

11. Cylinders: Number of cylinders (missing for Mazda RX-7, which has a rotary engine).

12. EngineSize: Engine size (litres).

13. Horsepower: Horsepower (maximum).

14. RPM: RPM (revs per minute at maximum horsepower).

15. Rev.per.mile: Engine revolutions per mile (in highest gear).

16. Man.trans.avail: Is a manual transmission version available? (yes or no, Factor).

17. Fuel.tank.capacity: Fuel tank capacity (US gallons).

18. Passengers: Passenger capacity (persons)

19. Length: Length (inches).

20. Wheelbase: Wheelbase (inches).

21. Width: Width (inches).

22. Turn.circle: U-turn space (feet).

23. Rear.seat.room: Rear seat room (inches) (missing for 2-seater vehicles).

24. Luggage.room: Luggage capacity (cubic feet) (missing for vans).

25. Weight: Weight (pounds).

26. Origin: Of non-USA or USA company origins? (factor).

27. Make: Combination of Manufacturer and Model (character).

### 1. Data Loading
Load **Cars93.csv** as a dataframe named ***cars***.

### 2. Descriptive Statistics
Find the mean, median, maximum, minimum and standard deviation of **Max.Price**.

### 3. Data Selecting
List all the records of compact cars. 

List all the records of compact cars made by Ford. 

### 4. Data Normalization

#### Min_Max normalization
In most cases, when you normalize data you eliminate the units of measurement for data, enabling you to more easily compare data from different places. Rescaling data to have values between 0 and 1 is usually called feature scaling. One possible formula to achieve this is:

x_new = x - x_min / x_max - x_min

Normalize the range of feature **Length** between 0 and 1 so that the minimum has value 0 and maximum has value 1.

### 5. Data Cleaning

List the items in **cars** which have missing values in the column **Min.Price**.

Replace missing values in **Min.Price** with its respective average value.

Replace missing values in **Max.Price** with its respective median value.

### 6. Data Visualization
We will exercise data visualization using the well-known Iris data set, which lists measurements of petals and sepals of three iris species.

In [None]:
iris = sns.load_dataset("iris")
iris

Drop the column of **species**.

#### 6.1 Box Plot
Use a box plot to display the distribution of each measurement in the dataframe features.

#### 6.2 Scatter Plot
Display the scatter matrix and histograms of all the features in the dataframe features. Can you find any features in a positive correlation?

#### 7. Correlation between Features
Get the Pearson correlation between every two columns in **features**.

### 8. Dim Reduction: PCA

In [None]:
from sklearn.decomposition import PCA 

The fit learns some quantities from the data, most importantly "explained variance" which is the ammount of variance explained by each of the selected component. 

Get the transformed data with one feature

### 9. Binning and Histogram

#### 9.1 Equal-width Binning

The `cut` method for Pandas data splits the dataset into bins. There are a number of arguments for the method. The following code creates equal sized bins. The method `value_counts` returns a frequency table.

Combine the binning data with the original data set

Sort the data by the values in **Min.Price**.

#### 9.2 Equal-depth Binning

The pandas documentation describes qcut as a “Quantile-based discretization function.” This basically means that qcut tries to divide up the underlying data into equal sized bins. The function defines the bins using percentiles based on the distribution of the data, not the actual numeric edges of the bins.

### 9.3 Histogram