## A. Loading Dataset from ScikitLearn

### Step 1
Load `iris` dataset from ScikitLearn using `load_iris()`. Assign the dataset to `X` and the target values to `y`.

In [3]:
from sklearn.datasets import load_iris

iris = load_iris()

In [14]:
X = iris['data']
y = iris['target']

### Step 2
Print the dataset keys using `iris_dataset.keys()`

In [4]:
iris.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])

### Step 3
Print the names of the categories in the target file

In [7]:
iris['target_names']

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

### Step 4
Print the feature names in the Iris dataset

In [8]:
iris['feature_names']

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

### Step 5
Print the type of the Iris dataset.

In [15]:
print("the type of the iris dataset is: ",type(X))

the type of the iris dataset is:  <class 'numpy.ndarray'>


### Step 6
Print the shape of the Iris dataset.

In [16]:
X.shape

(150, 4)

### Step 7
Print the first five rows of the Iris dataset.

In [17]:
X[:5]

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2]])

### Step 8
Print the type of the target variable of Iris dataset.

In [18]:
print("the type of the target variable: ", type(y))

the type of the target variable:  <class 'numpy.ndarray'>


### Step 9
Print the shape of the target variable of Iris dataset.

In [19]:
y.shape

(150,)

### Step 10
Print the entire target variable values of the Iris dataset.

In [21]:
print(y)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


### Step 11
Import numpy with `import numpy as np` and use the `numpy.unique()` function to print the unique values of the target variable of Iris dataset

In [22]:
import numpy as np 

np.unique(y)

array([0, 1, 2])

### Step 12
Split dataset into train and test datasets using `from sklearn.model_selection import train_test_split`

In [23]:
from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(X, y)

### Step 13
Print the shape of train/test datasets and the train/test target variables.

In [24]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((112, 4), (38, 4), (112,), (38,))

### Step 14
Build your K-neighbors classifier for nearest neighbor of 1 using `from sklearn.neighbors import KNeighborsClassifier`, fit the model to your train dataset and make a prediction for teh data point of `[5, 1.9, 1, 0.2]`. Print your prediction class value as an integer and also the corresponding string label.

In [26]:
from sklearn.neighbors import KNeighborsClassifier 

knn = KNeighborsClassifier(n_neighbors=8)
knn.fit(X_train, y_train)


KNeighborsClassifier(n_neighbors=8)

In [31]:
predict = knn.predict([[5, 1.9, 1, 0.2]])
print(predict)

[0]


In [34]:
print(iris['target_names'][predict])


['setosa']


# B. Loading a Dataset and exploring it

### Step 1
Import the modules needed to explore the data

In [36]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt


### Step 2
Import `auto_mpg.csv` dataset using pandas'`read_csv` function. Print the first three samples from your dataset, print the index range of the observations, and print the column names of your dataset

In [37]:
auto_data = pd.read_csv('auto_mpg.csv')
auto_data.head(3)

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin,car name
0,18.0,8,307.0,130.0,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150.0,3436,11.0,70,1,plymouth satellite


In [43]:
print('Index range: ', auto_data.index)
print('Column names: ', auto_data.columns)

Index range:  RangeIndex(start=0, stop=398, step=1)
Column names:  Index(['mpg', 'cylinders', 'displacement', 'horsepower', 'weight',
       'acceleration', 'model year', 'origin', 'car name'],
      dtype='object')


In [45]:
auto_data.shape

(398, 9)

### Step 3
Assign `mpg` column as output and name it as `y` and the rest of the data as the features and assign it to `X`. Print the shape of X.

In [44]:
y = auto_data['mpg']
X = auto_data.drop(['mpg'], axis=1)
X.shape

(398, 8)

### Step 4
Bonus: Check the dataset if there are any missing values in any of the columns using `isnull().any()` functions.

### Step 5
Check the data types of each feature. Which columns are continuous and which are categorical?

### Step 6
Look at the unique elements of horsepower

### Step 7
Let's describe data since everything looks in order. 
- See the statistical details of the dataset using `describe` and `info` methods.

### Step 8
Let's specifically look at the description of the mpg feature

### Step 9
Visualize the distribution of the features of the data using `hist` method, use `bins=20`. 

### Step 10
BONUS: Visualize the relationships between these data points. 

- Create a function to scale your dataset by using the formula $b=\frac{x-min}{max-min}$. 
- Using this function, scale `displacement`, `horsepower`, `acceleration`, weight`, and `mpg`.
- Create a boxplot of `mpg` for different `origin` values before and after scaling. 
