#IS 470 Lab 2: Data Exploration

---



##Part 1: Pokemon Data
<br>
This dataset contains information on 800 Pokemon from six generations of Pokemon.<br>
<br>
VARIABLE DESCRIPTIONS:<br>
number: The entry number of the Pokemon<br>
name: The English name of the Pokemon<br>
type1: The Primary Type of the Pokemon<br>
type2: The Secondary Type of the Pokemon<br>
hp: The Base HP of the Pokemon<br>
attack: The Base Attack of the Pokemon<br>
defense: The Base Defense of the Pokemon<br>
sp.atk: The Base Special Attack of the Pokemon<br>
sp.def: The Base Special Defense of the Pokemon<br>
speed: The Base Speed of the Pokemon<br>
generation: The numbered generation which the Pokemon was first introduced<br>
legendary: Denotes if the Pokemon is legendary.<br>
<br>

###1. Upload and clean data

In [None]:
# Mounting Google Drive
from google.colab import drive
drive.mount('/content/drive')

In [2]:
# Import libraries
import pandas as pd
import seaborn as sns

In [None]:
# Read data
pokemon = pd.read_csv('/content/drive/MyDrive/IS470_data/pokemon.csv')
pokemon

In [None]:
# Examine the number of rows and cols
pokemon.shape

In [None]:
# Show the head rows of a data frame
pokemon.head()

In [None]:
# Show the tail rows of a data frame
pokemon.tail()

In [None]:
# Examine missing values
pokemon.isnull().sum()

In [8]:
# Set missing values as none
pokemon = pokemon.fillna('None')

In [None]:
# Examine missing values again
pokemon.isnull().sum()

In [None]:
# Remove the unique identifier (pokemon number) from further analysis
pokemon = pokemon.drop(['#'],axis=1)
pokemon

In [None]:
# Examine variable type
pokemon.dtypes

In [12]:
# Change categorical variables to "category"
pokemon['Generation'] = pokemon['Generation'].astype('category')
pokemon['Type 1'] = pokemon['Type 1'].astype('category')
pokemon['Type 2'] = pokemon['Type 2'].astype('category')
pokemon['Legendary'] = pokemon['Legendary'].astype('category')

In [None]:
# Display all numeric variables
pokemon.select_dtypes(include=['number'])

In [None]:
# Display all categorical variables
pokemon.select_dtypes(include=['category'])

###2. Understanding a single variable: numeric variables

In [None]:
# Show the statistics of a numeric variable: Attack
pokemon['Attack'].describe()

In [None]:
# Show the statistics of two numeric variables: HP and Attack
pokemon[['HP','Attack']].describe()

In [None]:
# obtain the max value of a numeric variable: Attack
pokemon['Attack'].max()

In [None]:
# Find the pokemon with the highest attack value
pokemon[pokemon['Attack'] == pokemon['Attack'].max()]

In [None]:
# Obtain rows with one condition (find pokemon with attack value greater than 170)
pokemon[pokemon['Attack'] > 170]

In [None]:
# Obtain rows with multiple conditions (find pokemon with attack greater than 170 and Defense greater than 150)
pokemon[(pokemon['Attack'] > 170) & (pokemon['Defense'] > 150)]

In [None]:
# Obtain the variance, standard deviation, and range of a numeric varaible: Attack
print("variance: ", pokemon['Attack'].var(), "standard deviation: ", pokemon['Attack'].std(), "range: ", pokemon['Attack'].min(), pokemon['Attack'].max())

In [None]:
# IQR of Attack variable
IQR = pokemon['Attack'].quantile(0.75) - pokemon['Attack'].quantile(0.25)
print("IQR:", IQR)

In [None]:
# Boxplot of a numeric variable: Attack
snsplot = sns.boxplot(x='Attack', data = pokemon)
snsplot.set_title("Boxplot of Attack in the pokemon data set")

In [None]:
# Boxplot of a numeric variable: Defense
snsplot = sns.boxplot(x='Defense', data = pokemon)
snsplot.set_title("Boxplot of Defense in the pokemon data set")

In [None]:
# Boxplot of Attack of the 1st generation pokemon
snsplot = sns.boxplot(x='Attack', data = pokemon[pokemon['Generation']==1])
snsplot.set_title("Boxplot of Attack of the 1st generation pokemon")

In [None]:
# Boxplot of Defense of the 1st generation pokemon
snsplot = sns.boxplot(x='Defense', data = pokemon[pokemon['Generation']==1])
snsplot.set_title("Boxplot of Defense of the 1st generation pokemon")
snsplot.set_xlim([0, 240])

In [None]:
# Histogram of a numeric variable: Attack
snsplot = sns.histplot(x='Attack', data = pokemon)
snsplot.set_title("Histogram of Attack in the pokemon data set")

In [None]:
# Histogram of a numeric variable: Defense
snsplot = sns.histplot(x='Defense', data = pokemon)
snsplot.set_title("Histogram of Defense in the pokemon data set")

###3. Understanding a single variable: categorical variables

In [None]:
# Show the statistics of a categorical variable: Type 1
pokemon['Type 1'].describe()

In [None]:
# Show the counts of unique pokemon types
pokemon['Type 1'].value_counts()

In [None]:
# Show the proportion of unique pokemon types
pokemon['Type 1'].value_counts(normalize=True)

In [None]:
# Plot a categorical variable: Type 1
snsplot = sns.countplot(x='Type 1', data=pokemon)
snsplot.set_xticklabels(snsplot.get_xticklabels(), rotation=40, ha="right")
snsplot.set_title("countplot of Type 1 in the pokemon data set")

In [None]:
# Plot a categorical variable: Legendary
snsplot = sns.countplot(x='Legendary', data=pokemon)
snsplot.set_title("countplot of Legendary in the pokemon data set")

### 4. Understand relationships of multiple variables

In [None]:
# scatter plot two numeric variables: Attack and Defense
snsplot = sns.scatterplot(x='Attack', y= 'Defense', data=pokemon)
snsplot.set_title("Scatterplot of Attack and Defense")

In [None]:
# Generate correlation coefficients of two numeric variables in a 2x2 matrix: Attack and Defense
# corr() lies between -1 and 1. zero means no correlation. 1 or -1 indicates full correlation
# positive value means positive correlation and negative values mean negative relationships
pokemon[['Attack','Defense']].corr()

In [None]:
# Generate the correlation matrix of all numeric variables
pokemon[['HP','Attack','Defense','Sp. Atk','Sp. Def','Speed']].corr()
pokemon.corr()

In [None]:
# Generate 2D scatter plots
sns.pairplot(data = pokemon)

In [None]:
# Examine relationships between numeric and categorical variables: boxplot groups values of a numeric variable based on the values of a categorical variable.
snsplot = sns.boxplot(x='Attack', y= 'Type 1', data = pokemon)
snsplot.set_title("Boxplot of Attack based on pokemon type")
snsplot.set_xlim([0, 200])

In [None]:
snsplot = sns.boxplot(x='Attack', y= 'Type 1', data = pokemon[pokemon['Legendary']==False])
snsplot.set_title("Boxplot of Attack based on pokemon type for non-Legendary pokemon")
snsplot.set_xlim([0, 200])

In [None]:
snsplot = sns.boxplot(x='Attack', y= 'Type 1', data = pokemon[pokemon['Legendary']==True])
snsplot.set_title("Boxplot of Attack based on pokemon type for Legendary pokemon")
snsplot.set_xlim([0, 200])

##Part 2: CarAuction Data
<br>
This dataset contains information of cars purchased at the Auction.<br>
<br>
VARIABLE DESCRIPTIONS:<br>
Auction: Auction provider at which the  vehicle was purchased<br>
Color: Vehicle Color<br>
IsBadBuy: Identifies if the kicked vehicle was an avoidable purchase<br>
MMRCurrentAuctionAveragePrice: Acquisition price for this vehicle in average condition as of current day<br>
Size: The size category of the vehicle (Compact, SUV, etc.)<br>
TopThreeAmericanName:Identifies if the manufacturer is one of the top three American manufacturers<br>
VehBCost: Acquisition cost paid for the vehicle at time of purchase<br>
VehicleAge: The Years elapsed since the manufacturer's year<br>
VehOdo: The vehicles odometer reading<br>
WarrantyCost: Warranty price (term=36month  and millage=36K)<br>
WheelType: The vehicle wheel type description (Alloy, Covers)<br>

###1. Upload and clean data

In [None]:
#Import packages
import pandas as pd
import seaborn as sns

In [None]:
# Read data
carAuction = pd.read_csv("/content/drive/MyDrive/IS470_data/carAuction.csv")
carAuction

In [None]:
# Examine the number of rows and cols


In [None]:
# Show the head rows of a data frame


In [None]:
# Examine missing values again


In [None]:
# Examine variable type


In [None]:
# Change categorical variables to "category"


In [None]:
# Display all numeric variables


In [None]:
# Display all categorical variables


###2. Understanding a single variable: numeric variables

In [None]:
# Show the statistics of VehOdo


In [None]:
# Obtain the variance, standard deviation, and range of WarrantyCost


In [None]:
# Display the IQR of WarrantyCost


In [None]:
# Boxplot of a numeric variable: VehBCost


In [None]:
# Boxplot of a numeric variable: VehicleAge


In [None]:
# Histogram of a numeric variable: VehOdo


###3. Understanding a single variable: categorical variables

In [None]:
# Display the number of cars in different WheelType


In [None]:
# Disply the proportion of cars in different WheelType


In [None]:
# Plot a categorical variable: WheelType


### 4. Understand relationships of multiple variables

In [None]:
# scatter plot two numeric variables: VehBCost and MMRCurrentAuctionAveragePrice


In [None]:
# Generate correlation coefficients of two numeric variables in a 2x2 matrix: VehBCost and MMRCurrentAuctionAveragePrice


In [None]:
# Generate the correlation matrix of all numeric variables


In [None]:
# Examine relationships between numeric and categorical variables: boxplot VehBCost based on IsBadBuy


Question: list one thing you learned from the carAuction data exploration.<br>
Double-click to enter your answer. 

In [None]:
!jupyter nbconvert --to html "/content/drive/MyDrive/IS470_lab/IS470_lab2.ipynb"