<a href="https://www.kaggle.com/code/ampiiere/animal-crossing-villager-popularity-analysis?scriptVersionId=100600201" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns  # data visualization
import matplotlib.pyplot as plt


# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Introduction
The goal of this project is to analyse the relationship between animal crossing new horizon villager popularity amongst the player base and certain villager attributes. 

We will be analysing the Gender,Personality, Species, and Style of a villager. 
The scrapping was done on [Animal Crossing Portal](https://www.animalcrossingportal.com/), a fan made animal crossing community website. Specifically, The [Animal Crossing New horizons villager popularity list page.](https://www.animalcrossingportal.com/games/new-horizons/guides/villager-popularity-list.php#/). The code for the scrapper is on [Github](https://github.com/ampiiere/acnh-scrapper).

# Data Initilization and Cleaning

In [None]:
vlgr_df = pd.read_csv("/kaggle/input/animal-crossing-new-horizons-nookplaza-dataset/villagers.csv")
popul_df = pd.read_csv("/kaggle/input/acnh-villager-popularity/acnh_villager_data.csv")

In [None]:
vlgr_df.head()

In [None]:
popul_df.head()

### 1. Checking for null 

In [None]:
vlgr_df.info()

In [None]:
popul_df.info()

### 2. Checking for mismatched names

In [None]:
# There are some missing/non-matching names 
vlgr_df["Name"].isin(popul_df['name']).sum()

In [None]:
# vlgr_df does not have these names...
mismatch_names = popul_df["name"][popul_df["name"].isin(vlgr_df["Name"]) == False]
mismatch_names

In [None]:
# Data set is small enough to pick out the same names
# Correcting names in popul_df to match vlgr_df
popul_df['name'] = popul_df['name'].replace(['OHare'],"O\'Hare")
popul_df['name'] = popul_df['name'].replace(['Buck(Brows)'],"Buck")
popul_df['name'] = popul_df['name'].replace(['Renee'],"Renée")
popul_df['name'] = popul_df['name'].replace(['WartJr'],"Wart Jr.")
popul_df['name'] = popul_df['name'].replace(['Crackle(Spork)'],"Spork")

In [None]:
# Checking if All names match
vlgr_df["Name"].isin(popul_df['name']).sum()

In [None]:
# drop villagers that are in popul_df but not in vlgr_df
popul_df = popul_df.drop(popul_df[popul_df["name"].isin(vlgr_df["Name"]) == False].index)

### 3. Merging the two Dataframes

In [None]:
# Now that both df have same length, we can set index as names and combine the 2 dfs
popul_df.set_index('name', drop=True, inplace=True)
vlgr_df.set_index('Name', drop=True, inplace=True)

In [None]:
combined_df = popul_df.merge(vlgr_df, left_index=True, right_index=True)

In [None]:
# drop irrelevent columns
combined_df.drop(columns=['Furniture List', 'Filename', 'Unique Entry ID', "Wallpaper", "Flooring", "Birthday", "Favorite Song"], inplace=True)

#### Adding a new row named overall_ranking so we may know a villager's general ranking outside of their tier

In [None]:
combined_df.sort_values(['tier', 'rank'], inplace=True)
combined_df['overall_ranking'] = np.arange(1, len(combined_df)+1)
combined_df.insert(2, 'overall_ranking', combined_df.pop('overall_ranking'))

#### Setting Baseline overall ranking mean to compare against

In [None]:
overall_mean = combined_df.overall_ranking.mean()
print(f'The overall_mean is {overall_mean}, this would serve as a baseline for to compare against popularity performance of our features.')

In [None]:
combined_df.columns

# Exploratory Data Analysis
As a preface, a higher overall_ranking would mean performing worse on the popularity rankings.
### 1. Gender

In [None]:
combined_df['Gender'].value_counts()

In [None]:
combined_df.groupby('tier').Gender.value_counts().plot.barh()

For gender, there seems to be a disproporationate amount of male villagers in the lowest tier(6th tier) than female villagers, compared to other tiers. Discounting Tier 6, The number of male and female villagers are fairly even, with Male villagers having a slight lead in all tiers(except tier 6).

In [None]:
plt.figure(figsize=(5, 5))
plt.axhline(overall_mean, color='r')
sns.boxplot(x="Gender", y='overall_ranking', data=combined_df)

Female villagers generally perform better than Male villagers in terms of overall ranking. 

In [None]:
pd.pivot_table(combined_df, index = 'tier', values = 'Catchphrase', columns="Gender", aggfunc='count')

### 2. Species

In [None]:
# creating value counts dataframe for each species type
species_ranking = combined_df.groupby('Species').mean()['overall_ranking'].to_frame().reset_index().sort_values('overall_ranking')
species_ranking

In [None]:
plt.figure(figsize=(30,5))
sns.set(font_scale=1.4)
plt.xticks(rotation=45)
plt.axhline(overall_mean, color='r')
sns.scatterplot(x='Species', y="overall_ranking", data=species_ranking,label='mean overall-ranking', s=300)

Octopus, deer, wolves, cats and Koalas are most likely to be popular; while Kangaroos, Hippos, Mouse Pigs and Gorillas are the least likely to be popular. 

In [None]:
plt.figure(figsize=(30, 10))
plt.axhline(overall_mean, color='r')
sns.scatterplot(x="Species", y='overall_ranking', hue='tier', s=100, data=combined_df)

Although Octopuses seem to be ranking highly in part due to the low amount of Octopuses amongst the villagers. 
Interesting trend can be observed, there exists a ranking cap for low ranking speices, for example, none of the Gorilla villagers have a ranking lower than 200, it is heavily skewed, and not normally distributed.  Indicating a clear non-preference for certain species by the playerbase. 

### 3. Personality

In [None]:
combined_df.Personality.value_counts()

In [None]:
# creating value counts dataframe for each personality type
personality_ranking = combined_df.groupby('Personality').mean()['overall_ranking'].to_frame().reset_index().sort_values('overall_ranking')

In [None]:
plt.figure(figsize=(20,5))
sns.set(font_scale=1.4)
plt.xticks(rotation=45)
plt.axhline(overall_mean, color='r')
sns.scatterplot(x='Personality', y="overall_ranking", data=personality_ranking,label='mean personality ranking', s=300)

The playerbase seems to have a preference for Big sister, Normal, Peppy and sometimes Lazy type villagers.
While they dislike Cranky, Jock and Snooty villagers. 

In [None]:
plt.figure(figsize=(10, 10))
plt.axhline(overall_mean, color='r')
sns.boxplot(x="Personality", y='overall_ranking', data=combined_df)

There seems to be a clear preference for Big Sister, Peppy and Normal Personality villagers, they have means below overall mean. Rankings are fairly normally distributed except for Smug villagers. On the other hand, Cranky and Snooty both have a mean clearly above the overall mean.

In [None]:
pd.pivot_table(combined_df, index = 'tier', values = 'Catchphrase', columns="Personality", aggfunc='count')

### 4. Style

In [None]:
# generating value counts dataframe for each style type
style_ranking1 = combined_df.groupby('Style 1').mean()['overall_ranking'].to_frame().reset_index().sort_values('overall_ranking')
style_ranking2 = combined_df.groupby('Style 2').mean()['overall_ranking'].to_frame().reset_index().sort_values('overall_ranking')

In [None]:
# combining the 2 style columns and finding a mean
style_ranking = style_ranking1.copy()
style_series = (style_ranking1['overall_ranking'] + style_ranking2['overall_ranking'])/2
style_ranking["overall_ranking"] = style_series

In [None]:
style_ranking

In [None]:
plt.figure(figsize=(20,5))
sns.set(font_scale=1.4)
plt.xticks(rotation=45)
plt.axhline(overall_mean, color='r')
sns.scatterplot(x='Style 1', y="overall_ranking", data=style_ranking, s=300)

A very clear preference for Cute styled villagers. Simple Styled Villagers have a ranking mean just about equal to the overall mean, while other style villagers have a slightly above overall mean mean. 

In [None]:
plt.figure(figsize=(7, 7))
plt.axhline(overall_mean, color='r')
sns.boxplot(x="Style 1", y='overall_ranking', data=combined_df)
plt.title('Style 1')
plt.figure(figsize=(7, 7))
plt.axhline(overall_mean, color='r')
sns.boxplot(x="Style 2", y='overall_ranking', data=combined_df)
plt.title('Style 2')

The clear preference is Cute style dressing, in both Style columns. In particular, in Style 2 column Cute Styled Villagers have a higher concetration in lower rankings. Other styles seem to have a fairly normally distributed ranking, with the exception of Active Style Villagers in Style 1, right skewed, but the ranking mean is significantly above the overall ranking mean.  

In [None]:
pd.pivot_table(combined_df, index = 'tier', values = 'Catchphrase', columns="Style 1", aggfunc='count')

In [None]:
pd.pivot_table(combined_df, index = 'tier', values = 'Catchphrase', columns="Style 2", aggfunc='count')

# Conclusion
We may come to the conclusion, that the following attributes contribute to a villager's popularity:
- Gender: Despite Female Villagers having in general better popularity, this is likely due to the overwheling prescence of male villagers in the lowest tier. Other than the lowest tier, Male villagers in general perform slightly better.
- Species: Octopus, Wolf, Deer and Cat villagers perform the best. 
- Personality: Big Sister, Normal and Peppy villagers are in general the most popular. 
- Style: Cute Style villagers are very clearly the most popular