**My first blog**

This Blog is an instance of performing data analysis to show how to perform data analysis using Python. In this notebook, We will be using Palmer Archipelago (Antarctica) penguin dataset collected by Dr. Kristen Gorman. I'll use this data to perform basic data analysis and then build machine learning model to predict the species of penguins using palmer penguins dataset

Firstly, I'll import library that I will be using in this session.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

I'll use method "pd.read_csv()" to get dataset and use method ".sample()" to see sample of dataset.

In [None]:
df = pd.read_csv("../input/palmer-archipelago-antarctica-penguin-data/penguins_size.csv")
df.sample(10)

Now Let's play a little bit with our dataset.

Let's see meta data (infomation of dataset).

In [None]:
df.info()

Then let's check our dataset if there are any missing values

To check if there are any missing values in our dataset or not, I'll use method ".isnull()" and ".sum()" to summarize how many missing values there are in each columns.

In [None]:
df.isnull().sum()

After we checked our dataset, There are 5 columns that contain missing values (culmen_length_mm, culmen_depth_mm, flipper_length_mm, body_mass_g and sex)

let's clean the missing values

For columns culmen_length_mm, culmen_depth_mm, flipper_length_mm and body_mass_g, they are numeric variables. 

Let's calculate mean values for each column and replace missing values with mean values.

In [None]:
mean_culmen_length = round(df['culmen_length_mm'].mean(),1)
mean_culmen_depth = round(df['culmen_depth_mm'].mean(),1)
mean_flipper_length = round(df['flipper_length_mm'].mean(),1)
mean_body_mass = round(df['body_mass_g'].mean(),1)

print("mean_culmen_length : ", mean_culmen_length)
print("mean_culmen_depth :", mean_culmen_depth)
print("mean_flipper_length :",mean_flipper_length)
print("mean_body_mass :",mean_body_mass)

After calculating mean for each column, I'll replace missing missing values by using method ".replace()".

For first argument, You need to define what you want to replace. So in this case, I want to replace missing values. In order to tell function that we want to replace missing values we need to put np.nan for the first argument.

For second argument, You need to define what you want replace missing values with. In this case, I want to replace missing values that I calculated in the previous code cell.

And for third argument, Since I want to replace missing values with in the object df without making new object.
I'll set inplace argument as true so that replace method will replace missing values in the object df. 

In [None]:
df['culmen_length_mm'].replace(np.nan , mean_culmen_length , inplace=True)
df['culmen_depth_mm'].replace(np.nan , mean_culmen_depth , inplace=True)
df['flipper_length_mm'].replace(np.nan , mean_flipper_length , inplace=True)
df['body_mass_g'].replace(np.nan , mean_body_mass , inplace=True)

df.isnull().sum()

Now missing numeric variables are already replaced. For sex variable, since it's category variable.

I'll drop rows that contains missing sex variable. To drop missing values, I'll use method ".dropna".

For first argument, I'll tell function the variable that I want drop.

And for second argument, If you set axis as 0 , function will drop rows that cotain missing values. 

But if you set axis as 1 , Function will drop colum that contain missing values.

So in this case, I want ot drop rows that contain missing values, I'll set axis as 1.

In [None]:
df.dropna(subset=['sex'] , axis = 0 , inplace = True)

print(df.isnull().sum())
print("observation : ",len(df))

Let's use filter function to get data that you want.

To filter data in python, We can use method ".query()"

For example, If i want to get data of penguins that are male, The code to execute will be like in the code cell below.

In [None]:
df.query("sex == 'MALE'").sample(10)

But if you have more than one criterias to filter your data, You can link youe criterias with "and" or "or". 

For example, If you want to get data of penguins that are male and from Dream island, The code to execute will be like in the code cell below

In [None]:
df.query("sex == 'MALE' and island == 'Dream'").sample(10)

Now let's easily bin our penguins into different groups by
using body mass variable as a criteria.

To bin our penguins into different groups
by using body mass, We need to generate four different numbers from body mass variable that are equally distant and use these four numbers to bin our penguins.

I'll bin penguins into 3 different groups as small, medium and large group.

To generate numbers that are equally distant, I'll use "np.linspace()" function

To use "np.linspace()" function, There are three arguments to set. you need to set the range of number and specify how many numbers you want to generate.

In this case, I want to bin penguins into 3 different groups by using body mass variables, So the range of number that i'll generate is from minimum body mass to maximum body mass.

In [None]:
bin = np.linspace(df['body_mass_g'].min(),df['body_mass_g'].max(),4)
label_names = ['small' , 'medium' , 'large']
print(bin)

As the result of executing code, We get four different numbers that are equally distant.

Here is my criteria to bin penguins, If penguins' body mass is between 2700 and 3900, They will be binned into small group. 

If penguins' body mass is between 3900 and 5100,
They will be binned into medium group. If penguins' body mass is between 5100 and 6300,
They will be binned into large group.

So to do binning penguins into groups, I'll use "pd.cut()" function and create new column named size.

For the first arguments, You need to specify what variable you want to use to bin your data.

The second argument is number that you want to use to bin your data.

And the third argnment is the label of group that you want ot bin your data into.

In [None]:
df['size'] = pd.cut(df['body_mass_g'] , bin , labels = label_names )
df.sample(10)

After we binned our penguins in three different group, Let's create prediction model to predict specie of penguins by using variable flipper length and body mass variables. 

But firstly, let make some visualization to see relationship between flipper length and body mass. 

In [None]:
sns.scatterplot(data = df , x = 'flipper_length_mm' , y = 'body_mass_g' , hue = 'species' , alpha = 0.5 , palette=['blue','green','orange'])
plt.title("Relationship : flipper length vs body mass")
plt.xlabel("flipper length")
plt.ylabel("body mass")

As the graph displays, it seems that flipper length and body mass has positive correlation. 

In this graph, I use diffenrent colors to display different species.

As the result, it sseems that gentoo is the largest specie and Chinstrap and 

So let's calculate correlation of theses two variable by using method ".corr()"

In [None]:
df[['flipper_length_mm' , 'body_mass_g']].corr()

The correlation between flipper length and body mass is 0.87 which indicates that two variable are strongly positively correlated.

Now it 's time  to make a model for prediction.

We need  to import our model "KMeans" from sklearn library which is the model that we're going to make in this session and we need to import train_test_split function to split our data for training and testing. 

Why do we need to split data fro traing and testing?

In real life, If you take all your data for training model and you want to evaluate your model with data that your model' ve never seen before how do you get new data?

When working with real projects, I don't think you'll have much time to collect data again to evaluate your model because of deadline.

So that's why we need to split data into two partition, One is for training model and one is for testing or evaluating your model.



In [None]:
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split 


I've already import model and "train_test_split" function.

For number of clusters, I'll set number of clusters equal to 3 whcih is eqaul to penguins' sepcies (Adelie, Chinstrap, Gentoo)

As i told earlier, I'll use flipper length and body mass variable to build prediction model In this session.

So i create object 'x' (contains variable used to predict) and 'y' (contains species variable which will be used later)

and then put object 'x' and 'y' into 'train_test_split' function to split data where test size of data is equal to 20% of whole data.

So the code to execute will be like inthe following code cell.




In [None]:
model = KMeans(n_clusters = 3)

x = df[['flipper_length_mm' , 'body_mass_g']]
y = df['species']

x_train,x_test,y_train,y_test = train_test_split(x, y , test_size = 0.2)

After spliting data into 2 partition, I'll use method ".fit()" to fit dependent variables in the model.

In [None]:
model.fit(x_train)

Then, I'll use method ".labels_" to see the result of prediction from the model that i build.

In [None]:
model.labels_

I'll create new column and named "species" and put y_test into this new column.

and create new column named "predicted_species" and put the result of prediction from model into this new column.

In [None]:
x_train['species'] = y_train
x_train['predicted_species'] = model.labels_

x_train

I'll create new object nameก "center". This object contains three numbers of x axis and y axis generated by model.

To get number, I'll use method ".cluster_centers_".

In [None]:
center = model.cluster_centers_
print(center)

I take number of x axis and y axis and the result of prediction from model to make visaulization.

Here is scatter plot of relationship between flipper length and body mass. In each point, It represents each observation as species.

Compares to previous graph the I'll build before building model, Yellow, Blue and Green color represent 

as Getoo species, Chinstrap and Adelie respectively.

And three number of x axis and y axis will be displayed as black diamond in the graph.

Here is how model works to predict specie of penguins. After fitting model, it generates number of x axis and y axis displayed 

as black diamond in graph and I'll cal this as group point. 

and model will calculate distance between each data point and group point.

And model will group each data point into the group point that is the closest. 

In [None]:
plt.scatter(x = center[:,0] , y = center[:,1] , marker = 'D' , color = 'black')
sns.scatterplot(data=x_train , x = 'flipper_length_mm' , y = 'body_mass_g' , hue='predicted_species',alpha=0.5,palette=['blue','green','orange'])

In [None]:
x_train['predicted_species'].replace([0,1,2] , ['Adelie','Gentoo','Chinstrap'] , inplace=True)
x_train

In [None]:
pd.crosstab(x_train['species'] ,x_train['predicted_species'])

In [None]:
result = x_train['species'] == x_train['predicted_species']
print(result)

In [None]:
result.mean()

In [None]:
predict = model.predict(x_test)
print(predict)

In [None]:
x_test['species'] = y_test
x_test['predicted_species'] = predict

x_test

In [None]:
sns.scatterplot(data=x_test , x = 'flipper_length_mm' , y = 'body_mass_g' , hue='predicted_species',alpha=0.7,palette=['blue','green','orange'])
plt.scatter(x=center[:,0] , y=center[:,1] , color = 'black' , marker = 'D')

In [None]:
x_test['predicted_species'].replace([0,1,2],['Adelie','Gentoo','Chinstrap'],inplace=True)
x_test

In [None]:
pd.crosstab(x_test['species'] , x_test['predicted_species'])

In [None]:
result_test = x_test['species'] == x_test['predicted_species']
print(result_test)

In [None]:
result_test.mean()