# Dog Recommendation System
The system works by collecting data from users about their preferences and matching them with data about dogs, such as breed, size, and temperament. Users provide their preferences by answering a questionnaire that asks about their lifestyle, living situation, and preferences for a dog's size and temperament. The algorithm then matches the user's preferences with the characteristics of different dog breeds and recommends a few options that are likely to be a good fit for them.

## Importing Libraries 

### numpy (imported as np):
A popular numerical computation library that provides support for working with arrays, matrices, and mathematical functions.

### pandas (imported as pd): 
A powerful data manipulation library that provides data structures like DataFrames and Series, which are commonly used for data analysis tasks.

### seaborn (imported as sns):
A popular data visualization library based on matplotlib that provides additional functionality for creating attractive statistical plots.

### IPython.display:
A module that provides utilities for displaying interactive output in IPython notebooks, such as display() function which can be used to show dataframes or plots.

### sklearn.metrics.pairwise:
A module from the popular machine learning library scikit-learn that provides implementations of distance and similarity metrics commonly used in recommendation systems, such as euclidean_distances and cosine_similarity, which are used to compute distances and similarities between data points.


In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
from IPython.display import display
from sklearn.metrics.pairwise import euclidean_distances, cosine_similarity

### Loading the Dataset
The dataset is scraped data from the American Kennel Club wesite a version of which is publically available on Kaggle. It contains 282 rows of unique dog breeds but has many missing values that will be dealt with later. The dataset has 18 columns for traits like shedding, height, weight, life expectancy etc.

The group column categorizes each dog into one of 7 categories: toy, hound, terrier, working, sporting, non-sporting, herding and miscellaneous.

'_value' in the name of the column indicates that the column has numeric values of 0.2,0.4,0.6,0.8 or 1, a higher value indicating a higher occurance of the category. Ex. a value of 1 in shedding_value indicates high shedding.

In [2]:
# Load the dataset into a pandas dataframe
df = pd.read_csv('C:/Users/hp/OneDrive/Documents/SEM6/Flexi/akcdata-master/akcdata-master/data/akcdata.csv')

FileNotFoundError: [Errno 2] No such file or directory: 'C:/Users/hp/OneDrive/Documents/SEM6/Flexi/akcdata-master/akcdata-master/data/akcdata.csv'

In [None]:

display(df.head())

## Data Cleaning and Preprocessing

In [None]:
#Converting the height from inches to cm
df['min_height'] = df['min_height'].apply(lambda x: x * 2.54)
df['max_height'] = df['max_height'].apply(lambda x: x * 2.54)

# Remove the conversion of weights from lbs to kg
df['min_weight'] = df['min_weight'].apply(lambda x: x * 0.453592)
df['max_weight'] = df['max_weight'].apply(lambda x: x * 0.453592)

# Save the modified dataframe to a new csv file
df.to_csv('modified_dog_dataset.csv', index=False)

In [None]:
# Print the first few rows of the dataframe
display(df.head())

We now rename the first column of the DataFrame df from 'Unnamed: 0' to 'breed' using the rename() function from pandas. The columns parameter is used to specify the old column name as the key and the new column name as the value in a dictionary.

In [None]:
# Changing the name of the first column to 'breed'
df = df.rename(columns={'Unnamed: 0': 'breed'})

#### Removing Missing Values
Now we check if there are any missing values (NaN or Null) in the DataFrame df and prints the total count of missing values for each column.

Then, we remove any rows from df that contain missing values using the dropna() function, modifying df directly.

In [None]:
# Check for missing values
print(df.isna().sum())

# Drop any rows with missing values
df.dropna(inplace=True)

## Exploratory Data Analysis
The pre-processing resulted in 187 rows of unique dog breeds which can further be analysed. 

 We first print the datatypes using the dtypes attribute of a pandas DataFrame. 

In [None]:

# Check the data types of the columns
print(df.dtypes)

### Histogram 
The histplot() function from the seaborn library is then used to create a histogram of the distribution of the target variable 'popularity'. The data parameter is set to df to specify the DataFrame to be used, and the x parameter is set to 'popularity' to specify the column to be plotted on the x-axis. The bins parameter is set to 20 to specify the number of bins in the histogram. This operation creates a histogram plot of the distribution of the 'popularity' values in the DataFrame df.

In [None]:
# Explore the distribution of the target variable
sns.histplot(data=df, x='popularity', bins=20)

### Pairplot 
The pairplot() function from the seaborn library is used to create a scatterplot matrix that shows the pairwise relationships between the target variable 'popularity' and other variables in the DataFrame df. 

The data parameter is set to df to specify the DataFrame to be used, and the vars parameter is set to a list of column names (['popularity', 'min_height', 'max_height', 'min_weight', 'max_weight']) to specify the variables to be plotted.

This operation creates a scatterplot matrix that visualizes the relationships between the 'popularity' and the other variables in df, allowing for exploration of potential correlations or patterns in the data.

In [None]:
# Explore the relationship between the target variable and other variables
sns.pairplot(data=df, vars=['popularity', 'min_height', 'max_height', 'min_weight', 'max_weight'])

## Feature Engineering

We perform operations on some '_value' columns to obtain new featurs that are columns of ones and zeroes to be used in the model

### Shedding Feature
We iterate over the columns in the DataFrame df to create three new columns in df with prefixes 'high_', 'medium_', and 'low_', respectively, followed by the original column name with '_value' removed. These new columns are used to categorize the values in the original column into high, medium, or low based on specific conditions.

We apply lambda functions to the original column to categorize the values into high, medium, or low categories based on their values. The conditions for each category are defined as follows:
- 'high_' prefix: values greater than or equal to 0.8
- 'medium_' prefix: values between 0.4 and 0.8 (inclusive)
- 'low_' prefix: values less than or equal to 0.4

Next we modify the original column by applying another lambda function that converts the original values into a list containing the original value and 0. This operation effectively replaces the original values in the column with lists containing the original value and a 0 [original value, 0]

Lastly, we select the columns 'shedding_value', 'shedding_category', 'high_shedding', 'medium_shedding', and 'low_shedding' from the DataFrame df and returns a new DataFrame containing only these columns.

In [None]:
for col in [col for col in df.columns if 'value' in col]:
    df[('high_'+col).replace('_value','')] = df[col].apply(lambda x: x >= .8)
    df[('medium_'+col).replace('_value','')] = df[col].apply(lambda x: .4 <= x <= .8)
    df[('low_'+col).replace('_value','')] = df[col].apply(lambda x: x <= .4)
    
    df[col] = df[col].apply(lambda x: [x,0])

    df[['shedding_value', 'shedding_category', 'high_shedding', 'medium_shedding', 'low_shedding']]

### Height, Weight, Expectancy related Features

#### Creating columns for average values of height, weight, expectancy
This code block calculates the average (mean) value for the columns 'height', 'weight', and 'expectancy' in the DataFrame df by taking the sum of the maximum and minimum values for each of these columns and dividing it by 2.

The calculated average value is then assigned to a new column with the same name as the original column but without the prefix 'max_' or 'min_'. This operation effectively replaces the original columns 'height', 'weight', and 'expectancy' with the calculated average values, which could be useful for further analysis or modeling purposes.


In [None]:
for col in ['height','weight','expectancy']:
    df[col] = (df['max_'+col] + df['min_'+col])/2

#### height, weight, expectancy columns for relative position values based on percentile range

We calculate percentiles for 'height', 'weight', and 'expectancy' columns in the DataFrame df using the describe() function.
We create three binary columns ('high_'+col, 'medium_'+col, 'low_'+col) indicating whether the values are higher, within a certain range, or lower than the percentile values. 

In [None]:
for col in ['height','weight','expectancy']:
    temp = df[col].describe(percentiles=[.2,.33,.4,.6,.67,.8])
    df['high_'+col] = df[col].apply(lambda x: x > temp['67%'])
    df['medium_'+col] = df[col].apply(lambda x: temp['33%'] < x < temp['67%'])
    df['low_'+col] = df[col].apply(lambda x: x < temp['33%'])
    

We then create new columns with '_value' suffix (height_value, weight_value, expectancy_value), assigning values based on percentile ranges, and converting them to a list with the original value as the first element and 0 as the second element.

This operation categorizes the values into discrete bins based on percentiles and creates binary columns for relative position.

In [None]:
  df[col+'_value'] = df[col].apply(lambda x: '1' if x >= temp['80%'] else x)
    df[col+'_value'] = df[col+'_value'].apply(lambda x: '.8' if ((type(x)!=str) and (x >= temp['60%']) and (x < temp['80%'])) else x)
    df[col+'_value'] = df[col+'_value'].apply(lambda x: '.6' if ((type(x)!=str) and (x >= temp['40%']) and (x < temp['60%'])) else x)
    df[col+'_value'] = df[col+'_value'].apply(lambda x: '.4' if ((type(x)!=str) and (x >= temp['20%']) and (x < temp['40%'])) else x)
    df[col+'_value'] = df[col+'_value'].apply(lambda x: '.2' if ((type(x)!=str) and (x < temp['20%'])) else x) 
    

    df[col+'_value'] = df[col+'_value'].apply(lambda x: [float(x),0])

We now display all the weight columns as a dataframe.



- 'min_weight': contains the minimum weight in kg for a particular breed.

- 'max_weight': contains the maximum weight in kg for a particular breed.

- 'weight': is calculated as the average of 'min_weight' and 'max_weight', representing the estimated weight for a particular breed.

- 'high_weight', 'medium_weight', 'low_weight': are binary columns indicating whether the estimated weight falls in the higher, medium, or lower range based on percentile values.

- 'weight_value': contains values assigned based on percentile ranges for the estimated weight, converted to a list with the original value as the first element and 0 as the second element.

In [None]:
df[['min_weight', 'max_weight', 'weight', 'high_weight', 'medium_weight', 'low_weight', 'weight_value']]

We create a list of column names to be used as output columns. The list includes 'group' and 'temperament' as well as any column names from df that contain substrings 'min_', 'max_', or 'category'.

- 'group': represents the group or category of the dog breed.
- 'temperament': represents the temperament or behavior traits associated with the dog breed.
- [col for col in df.columns if any([substr in col for substr in ['min_', 'max_', 'category']])]: dynamically generates a list of column names from df that contain 'min_', 'max_', or 'category' as substrings. 

In [None]:
output_cols = ['group', 'temperament'] + [col for col in df.columns if any([substr in col for substr in ['min_', 'max_', 'category']])]
display(output_cols)

### Group Feature

We define a function recommend_popular_dogs() that takes optional arguments group, low, medium, and high as input. The function is designed to recommend popular dog breeds based on specified criteria.

The function first checks the types of the input arguments group, low, medium, and high, and converts them to lists if they are strings.

The function then creates a sorted copy of df based on the 'popularity' column and assigns it to the temporary variable 'temp'.

The function filters the 'temp' dataframe based on the criteria provided in the input arguments:
If 'group' is provided, the 'temp' dataframe is filtered to include only rows where the 'group' column value matches any of the values in the 'group' input list.
If 'low', 'medium', or 'high' lists are provided, the 'temp' dataframe is filtered based on the corresponding columns ('low_', 'medium_', 'high_') and values in the input lists.


The function then prints the names of the recommended dog breeds, limited to a maximum of 10 breeds or the number of breeds available in 'temp', whichever is smaller.
For each recommended dog breed, the function prints the breed name, description, and values from the 'output_cols' list. The function returns None as there is no explicit return statement.

In [None]:
def recommend_popular_dogs(group=[],low=[],medium=[],high=[]):

    #converting arguments to lists if they are stringd
    if type(group) == str:
        group = [group]
    if type(low) == str:
        low = [low]
    if type(medium) == str:
        medium = [medium]
    if type(high) == str:
        high = [high]
    
    #sorting temp according to popularity
    temp = df.sort_values('popularity')

    #filter for input values
    if len(group) > 0: #if group value is provided
        temp = temp[temp['group'].isin(group)]#show rows where group value matches the input 

    if len(low) > 0: #if 'low_' input is given
        for col in low:
            temp = temp[temp['low_'+col]]#show rows where low_col value matches input value
    if len(medium) > 0:
        for col in medium:
            temp = temp[temp['medium_'+col]]
    if len(high) > 0:
        for col in high:
            temp = temp[temp['high_'+col]]
    
    
    # limiting recomendations to top 10 dogs
    num_dogs = min(10,len(temp))
    

    #printing recommended dogs
    for i in range(num_dogs):
        print('{}.'.format(i+1),temp['breed'].iloc[i])
    
    for i in range(num_dogs):
        print()
        print('{}.'.format(i+1),temp['breed'].iloc[i])
        print(temp['description'].iloc[i])
        print(temp[output_cols].iloc[i])

    return

We modify the 'group' column in the 'df' dataframe by applying a lambda function to each element in the column to remove 'Group' substring from values. The lambda function uses the 'replace()' method to remove the string 'Group' from the values in the 'group' column.

In [None]:
df['group']=df['group'].apply(lambda x: x.replace('Group',''))

#calling the defined function
recommend_popular_dogs()

We call the 'recommend_popular_dogs()' with the 'group' parameter set to 'Toy'. This filters the 'df' dataframe to include only dog breeds belonging to the 'Toy' group.

In [None]:
recommend_popular_dogs(group='Toy ')

### One-Hot Encoding Feature Engineering

#### Temperament Feature

We split the temperament column using lambda function to get a list of strings. Then we store unique values from "temperament list" in a list called "temperament_no_repeats" using the set() function. 


We now create one-hot encoded columns for each unique temperament value using a lambda function. The lambda function iterates over the "temperament_no_repeats" list and checks if each value is present in the "temperament list" column. If a value is present, it is encoded as 1, otherwise 0. The result is stored in a new column called "one-hot temperament".

In [None]:
# Split the "temperament" column into a list of strings
df['temperament list'] = df['temperament'].apply(lambda x: x.split(',') if type(x) == str else [])

# Extract the unique temperament values
temperament = []
for i in df['temperament list']:
    temperament.extend(i)
temperament_no_repeats = set(temperament)

# Create one-hot encoded columns for each unique temperament value
df['one-hot temperament'] = df['temperament list'].apply(lambda x: [int(temperament in x) for temperament in temperament_no_repeats])


#### Group Feature

We Extract the unique values from the "group" column using the unique() function, and store them in a list called "group_no_repeats".

Then, we one-hot encoded columns for each unique group value using a lambda function. The lambda function iterates over the "group_no_repeats" list and checks if each value is present in the "group" column. If a value is present, it is encoded as 1, otherwise 0. The result is stored in a new column called "one-hot group".

In [None]:
group_no_repeats = df['group'].unique()
df['one-hot group'] = df['group'].apply(lambda x: [int(group in x) for group in group_no_repeats])

The recommend_similar_dogs() function takes several arguments as input and converts them into lists if they are provided as strings.

We Create a temporary dataframe temp by selecting columns from the original dataframe df based on the columns to ignore and filter temp based on the input arguments. 


Computes similarity scores between dogs in temp based on various columns using Euclidean distances and cosine similarities, and stores the results in a numpy array sims.
Identifies the index of the input breed in temp and uses it to sort the similarity scores in descending order.
Selects the top 10 similar breeds based on the similarity scores and stores their indices in breed_indices.
Prints the list of recommended similar breeds with their descriptions and selected columns from temp.
Note: The actual implementation may vary depending on the specific structure and content of the dataframe df, as well as the input arguments provided to the function.

## Similarity Score 

In [None]:
def recommend_similar_dogs(breed,group=[],low=[],medium=[],high=[],ignore=[],important=[]):

    #converting to lists if input is string
    if type(group) == str:
        group = [group]
    if type(low) == str:
        low = [low]
    if type(medium) == str:
        medium = [medium]
    if type(high) == str:
        high = [high]
    if type(ignore) == str:
        ignore = [ignore]
    
    
    #creating a temporary dataframe 
    temp_cols = set(df.columns) - set(ignore)
    temp = df[temp_cols]


    if len(group) > 0: #if 'group' inputs are given
        temp = temp[(temp['breed']==breed)|(temp['group'].isin(group))] #filter only those values whose values are eqult to input 'group'
    if len(low) > 0:
        for col in low:
            temp = temp[(temp['breed']==breed)|(temp['low_'+col])]
    if len(medium) > 0:
        for col in medium:
            temp = temp[(temp['breed']==breed)|(temp['medium_'+col])]
    if len(high) > 0:
        for col in high:
            temp = temp[(temp['breed']==breed)|(temp['high_'+col])]
    temp = temp.reset_index(drop=True)
    

We start by initializing a 2D array with zeroes. It's length is as much as the length of 'temp' list.

We iterate over all columns of temp that have a substring 'value'. If a given column is in the 'important' input list, the similarity score is calculated and added to sims array. If the given column is not in the 'important' input list,the similarity score is calculated with a different formula

In [None]:
sims = np.zeros([len(temp),len(temp)]) #initialized a 2D array with zeroes. It's length is as much as the length of 'temp' list 


    for col in [col for col in temp.columns if 'value' in col]: #iterating over all columns with substring "value" in temp
        if col in important:
            sims += 5*(1-np.array(euclidean_distances(temp[col].tolist(),temp[col].tolist())))
        else:
            sims += (1-np.array(euclidean_distances(temp[col].tolist(),temp[col].tolist())))

We iterate over one-got temperament and group columns to find similarity scores

In [None]:
for col in ['one-hot temperament','one-hot group']:
        if col in important:
            sims += 5*np.array(cosine_similarity(temp[col].tolist(),temp[col].tolist()))
        else:
            sims += np.array(cosine_similarity(temp[col].tolist(),temp[col].tolist()))
    

We now find the index of the given breed in the dataframe 'temp' and create 'sims' list tha that has index and similarity score. 
We sort sims using a lambda function to display a decreasing order of similarity score

In [None]:
idx = temp[temp['breed']==breed].index
    sims = list(enumerate(sims[idx][0]))
    
    sims = sorted(sims, key=lambda x: x[1], reverse=True) #decending order sorting of sims

    num_dogs = min(10,len(temp))# checking which is lesser; length of temp or 10
    sims = sims[:num_dogs+1] # slicing to get top 10 or top len(temp) if it's lesser than 10

    breed_indices = [i[0] for i in sims] #breed_indices is all indices in sims
    
    n = 0
    for i in breed_indices:
        if n == 0:
            print('Selected:'.format(n),temp['breed'].iloc[i])
        else:
            print('{}.'.format(n),temp['breed'].iloc[i])
        n += 1
    
    n = 0
    for i in breed_indices:
        print()
        if n == 0:
            print('Selected:'.format(n),temp['breed'].iloc[i])
        else:
            print('{}.'.format(n),temp['breed'].iloc[i])
        print(temp['description'].iloc[i])
        print(temp[output_cols].iloc[i])
        n += 1
    return

Testing the function with given input lists

In [None]:
recommend_similar_dogs('Shiba Inu', high=['demeanor' , 'trainability'],important=['group','height','weight'])