PetFinder

![Paw](https://storage.googleapis.com/kaggle-media/competitions/Petfinder/PetFinder%20-%20Logo.png)

![Cat](https://www.petfinder.my/images/cuteness_meter.jpg) 

A picture is worth a thousand words. But did you know a picture can save a thousand lives? Millions of stray animals suffer on the streets or are euthanized in shelters every day around the world. You might expect pets with attractive photos to generate more interest and be adopted faster. But what makes a good picture? With the help of data science, you may be able to accurately determine a pet photo’s appeal and even suggest improvements to give these rescue animals a higher chance of loving homes.

PetFinder.my is Malaysia’s leading animal welfare platform, featuring over 180,000 animals with 54,000 happily adopted. PetFinder collaborates closely with animal lovers, media, corporations, and global organizations to improve animal welfare.

Currently, PetFinder.my uses a basic Cuteness Meter to rank pet photos. It analyzes picture composition and other factors compared to the performance of thousands of pet profiles. While this basic tool is helpful, it's still in an experimental stage and the algorithm could be improved.

In this competition, you’ll analyze raw images and metadata to predict the “Pawpularity” of pet photos. You'll train and test your model on PetFinder.my's thousands of pet profiles. Winning versions will offer accurate recommendations that will improve animal welfare.

If successful, your solution will be adapted into AI tools that will guide shelters and rescuers around the world to improve the appeal of their pet profiles, automatically enhancing photo quality and recommending composition improvements. As a result, stray dogs and cats can find their "furever" homes much faster. With a little assistance from the Kaggle community, many precious lives could be saved and more happy families created.

Top participants may be invited to collaborate on implementing their solutions and creatively improve global animal welfare with their AI skills.

![pets](https://www.petfinder.my/images/cuteness_meter-showcase.jpg)

# <span id = "1"></span>Overview
<hr/>
Welcome to my Kernel! In this kernel I aim to apply machine learning algorithms. By doing this, I believe that we will understand the mechansim and theory behind the science better. 

<font color = 'green'><b>UPVOTE</b></font>:**It will be very much appreciated and will motivate me to offer more content to the** <font color='red'><b>kaggle</b></font>**community**😎 
<br/>
<img src="https://i.imgur.com/QPWu3Rd.png" title="source: Gradient Descent" height="400" width="800" />

# Read Data

In [None]:
import pandas as pd
import numpy as np

# /kaggle/input/petfinder-pawpularity-score/train.csv
# /kaggle/input/petfinder-pawpularity-score/test.csv
# /kaggle/input/petfinder-pawpularity-score/test/

train_df = pd.read_csv('/kaggle/input/petfinder-pawpularity-score/train.csv')
train_df

In [None]:
test_df = pd.read_csv('/kaggle/input/petfinder-pawpularity-score/test.csv')
test_df

In [None]:
from glob import glob
import random as r
import os

# Path variables
base_path = '/kaggle/input/petfinder-pawpularity-score/'
train_path = base_path + 'train.csv'
test_path = base_path + 'test.csv'

#In Python, the glob module is used to retrieve files/pathnames matching a specified pattern.
train_images = glob(base_path+'train/*.jpg')
test_images = glob(base_path+'test/*.jpg')

target = 'Pawpularity'
seed = 5682
r.seed(seed)
np.random.seed(seed)
#If PYTHONHASHSEED is set to an integer value, it is used as a fixed seed for generating the hash() of the types covered by the hash randomization.
os.environ['PYTHONHASHSEED'] = str(seed)

features = [c for c in train_df.columns if c not in ['Id','Target']]
df = train_df.copy()

# Most popular Pet 

In [None]:
import matplotlib.pyplot as plt

mp = df[df[target] == df[target].max()].iloc[0]
path = f"{base_path}train/{mp['Id']}.jpg"
im = plt.imread(path)
plt.figure(figsize=(15,6))
plt.imshow(im)
plt.title('Most popular pet')

In [None]:
df[df['Id'] == path.split('/')[-1].split('.')[0]]

# Least popular pet

In [None]:
import matplotlib.pyplot as plt

mp = df[df[target] == df[target].min()].iloc[0]
path = f"{base_path}train/{mp['Id']}.jpg"
im = plt.imread(path)
plt.figure(figsize=(15,6))
plt.imshow(im)
plt.title('Least popular pet')

# Split the dataset

In [None]:
from sklearn.model_selection import train_test_split

x = df[features]
y = df[target]

X_train, X_test, y_train, y_test = train_test_split(x, y,test_size=0.3,random_state=seed)


In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import Ridge

from sklearn.metrics import mean_squared_error, r2_score

# Ref: https://www.kdnuggets.com/2020/11/simple-python-package-comparing-plotting-evaluating-regression-models.html
# Ref: https://machinelearningmastery.com/overfitting-machine-learning-models/

# Evaluating the models


### K-Fold Cross Validation method : A quick recap


K-Fold Cross Validation is a more sophisticated approach that generally results in a less biased model compared to other methods. This method consists in the following steps:

Divides the n observations of the dataset into k mutually exclusive and equal or close-to-equal sized subsets known as “folds”. 
Fit the model using k-1 folds as the training set and one fold (kth) as the test set. After each iteration has been finished, store the error of the model.
Repeat this process k times using one different fold every time as a test set and the remaining folds (k-1) as the training set. 
Once all the iterations have finished, take the mean of the k models. This would be the Mean Squared Error of the model.
The error model in using the K-Fold cross validation has the following formula:

![Cv](https://financetrain.sgp1.cdn.digitaloceanspaces.com/error-model.png)

The following figure illustrates k-fold cross-validation with k=5. There are some other schemes to divide the training set, we'll look at them briefly later.

![K Fold Cross validation](https://financetrain.sgp1.cdn.digitaloceanspaces.com/image16.png)

In [None]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
models = []
models.append(('LR',LinearRegression()))
models.append(('DR',DecisionTreeRegressor(random_state = seed,max_depth=3)))
models.append(('SVR',SVR(kernel='rbf', gamma='auto')))
models.append(('RFR',RandomForestRegressor(n_estimators = 300, random_state = 0)))
models.append(('Ridge',Ridge()))
kfold = KFold(n_splits=10,random_state=seed)
results = []
names = []
scoring = 'r2'
for name, model in models:
    kfold = KFold(n_splits=10,random_state=seed)
    cv_res = cross_val_score(model,X_train, y_train, cv=kfold,scoring = scoring)
    results.append(cv_res)
    names.append(name)
    msg = '%s : %f (%f)' %(name,cv_res.mean(),cv_res.std())
    print(msg)
    plt.plot(cv_res , label = name)
    plt.title('CV Results comparison')
    plt.xlabel('Model'+name)
    plt.ylabel('CV Result')

    plt.legend()
    plt.show()
    
# Ref: https://www.projectpro.io/recipes/compare-sklearn-classification-algorithms-in-python

# Observation

* Linear and Ridge Regression Models have constant stable CV Results
* Support vector, Random forest , Decision trees show fluctuating performance output. Kfold reduces overfitting.

> K-fold cross validation is a standard technique to detect overfitting. It cannot "cause" overfitting in the sense of causality. However, there is no guarantee that k-fold cross-validation removes overfitting.
> The data set is divided into k subsets, and the holdout method is repeated k times. Each time, one of the k subsets is used as the test set and the other k-1 subsets are put together to form a training set. Then the average error across all k trials is computed. The advantage of this method is that it matters less how the data gets divided. Every data point gets to be in a test set exactly once, and gets to be in a training set k-1 times. The variance of the resulting estimate is reduced as k is increased. The disadvantage of this method is that the training algorithm has to be rerun from scratch k times, which means it takes k times as much computation to make an evaluation.

In [None]:
#We have also ploted Box Plot to clearly visualize the result.
fig = plt.figure(figsize = (10,10))
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()

In [None]:
import seaborn as sns
v = pd.DataFrame([results],columns=names)
ax = sns.swarmplot(data = results)
ax.set_xticklabels(names)
plt.xlabel('Model')
plt.ylabel('Results')
plt.title('Visualize model performance')
plt.show()

# https://seaborn.pydata.org/generated/seaborn.swarmplot.html


# Ridge regression model 
-- building and evaluation

Now below, we will perform cross-validation on the data set and get our scores. This might be more appropriate if we had a test/train set and were predicting. However, having the R square gives us some context and helps going into the next cell where we are looking at how alpha changes our R square.

In [None]:
# Import the necessary module
from sklearn.model_selection import cross_val_score

# Create a linear regression object: reg
reg = LinearRegression()

# Compute 5-fold cross-validation scores: cv_scores
cv_scores = cross_val_score(reg, x, y, cv=5)

# Print the 5-fold cross-validation scores
print(cv_scores)

# find the mean of our cv scores here
print("Average 5-Fold CV Score: {}".format(np.mean(cv_scores)))

Below we'll run a ridge regression and see how score varies with different alphas. This will show how picking a different alpha score changes the R2.

In [None]:
from sklearn.linear_model import Ridge

# Create an array of alphas and lists to store scores
alpha_space = np.logspace(-4, 0, 50)
ridge_scores = []
ridge_scores_std = []

# Create a ridge regressor: ridge
ridge = Ridge(normalize=True)

# Compute scores over range of alphas
for alpha in alpha_space:

    # Specify the alpha value to use: ridge.alpha
    ridge.alpha = alpha
    
    # Perform 10-fold CV: ridge_cv_scores
    ridge_cv_scores = cross_val_score(ridge, x, y, cv=10)
    
    # Append the mean of ridge_cv_scores to ridge_scores
    ridge_scores.append(np.mean(ridge_cv_scores))
    
    # Append the std of ridge_cv_scores to ridge_scores_std
    ridge_scores_std.append(np.std(ridge_cv_scores))

# Use this function to create a plot    
def display_plot(cv_scores, cv_scores_std):
    fig = plt.figure()
    ax = fig.add_subplot(1,1,1)
    ax.plot(alpha_space, cv_scores)

    std_error = cv_scores_std / np.sqrt(10)

    ax.fill_between(alpha_space, cv_scores + std_error, cv_scores - std_error, alpha=0.2)
    ax.set_ylabel('CV Score +/- Std Error')
    ax.set_xlabel('Alpha')
    ax.axhline(np.max(cv_scores), linestyle='--', color='.5')
    ax.set_xlim([alpha_space[0], alpha_space[-1]])
    ax.set_xscale('log')
    plt.show()

# Display the plot
display_plot(ridge_scores, ridge_scores_std)

# Submission

You can change the model and predict the popularity score

In [None]:

x = train_df.drop(columns='Id').iloc[:,:12]
y = train_df['Pawpularity']

# split data
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=210)

# clf = Ridge(alpha = 1)
reg = LinearRegression()
reg.fit(x_train, y_train)
y_pred = reg.predict(x_test)

# from sklearn.metrics import mean_squared_error
# mse=mean_squared_error(y_test,y_pred)
# rmse=np.sqrt(mse)
# rmse

In [None]:
ids = test_df['Id'].to_list()
sub_dict = {"Id": ids, "Pawpularity": list(y_pred[0:8])}
final_df = pd.DataFrame(sub_dict)
final_df.to_csv('./submission.csv', index=False)

![Upvote](https://i.pinimg.com/originals/c6/ef/35/c6ef35fef7a2c4baf86a5a6732b2652d.gif)

If this helped you, your **UPVOTES** would be very much appreciated – as they are the source of motivation!

Happy learning

# References:

> * [Comparing Regression models](https://www.kaggle.com/ankitjha/comparing-regression-models)
> * [Regression algorithms using 'scikit-learn'](https://www.kaggle.com/amar09/regression-algorithms-using-scikit-learn)
