# Analysing Bumble Profiles using Python

## Dataset

The dataset is available from the [bumble.csv](bumble.csv) file. The following columns are in the dataset:
1. `age`: Age of the user.
2. `status`: Relationship status (e.g., single, married, seeing someone).
3. `gender`: Gender of the user (e.g., m, f).
4. `body_type`: Descriptions of physical appearance (e.g., athletic, curvy, thin).
5. `diet`: Dietary preferences (e.g., vegetarian, vegan, anything).
6. `drinks`: Drinking habits (e.g., socially, often).
7. `education`: Education level (e.g., college, masters).
8. `ethnicity`: Ethnicity of the user (e.g., white, asian, black).
9. `height`: Height of the user (in inches).
10. `income`: User-reported annual income.
11. `job`: Job sector of the user (e.g., sales, marketing, student)
12. `last_online`: Date and time when the user was last active.
13. `location`: City and state where the user resides.
14. `pets`: Pet preference of the user (e.g.,likes cats, has dogs)
15. `religion`: Religion of the user.
16. `sign`: Star sign of the user.
17. `speaks`: Languages the user speaks.

## Part 1: Data Cleaning

### 1. Inspecting Missing Data

Missing data is a common issue in real-world datasets. On a platform like Bumble, missing user information might reflect gaps in the user profile setup process, incomplete data collection, or users intentionally leaving certain fields blank.

Now we need to assess the extent of missing data, understand its potential impact, and decide the most appropriate methods to address it.

First, we will need to load the dataset from the CSV file into a Pandas Dataframe.

Before we can write the python code, we would need to install the `pandas` and `numpy` libraries by running the following command in your terminal (Bash for Linux users and Powershell for Windows users). Make sure that Python is in your PATH.

In [1]:
pip install pandas numpy

Note: you may need to restart the kernel to use updated packages.


Once, `pandas` is successfully installed, run the following python code snippet to load the CSV file into a Pandas Dataframe.

In [3]:
import pandas as pd
import numpy as np

df = pd.read_csv("bumble.csv")
df.head()

Unnamed: 0,age,status,gender,body_type,diet,drinks,education,ethnicity,height,income,job,last_online,location,pets,religion,sign,speaks
0,22,single,m,a little extra,strictly anything,socially,working on college/university,"asian, white",75.0,-1,transportation,2012-06-28-20-30,"south san francisco, california",likes dogs and likes cats,agnosticism and very serious about it,gemini,english
1,35,single,m,average,mostly other,often,working on space camp,white,70.0,80000,hospitality / travel,2012-06-29-21-41,"oakland, california",likes dogs and likes cats,agnosticism but not too serious about it,cancer,"english (fluently), spanish (poorly), french (..."
2,38,available,m,thin,anything,socially,graduated from masters program,,68.0,-1,,2012-06-27-09-10,"san francisco, california",has cats,,pisces but it doesn&rsquo;t matter,"english, french, c++"
3,23,single,m,thin,vegetarian,socially,working on college/university,white,71.0,20000,student,2012-06-28-14-22,"berkeley, california",likes cats,,pisces,"english, german (poorly)"
4,29,single,m,athletic,,socially,graduated from college/university,"asian, black, other",66.0,-1,artistic / musical / writer,2012-06-27-21-26,"san francisco, california",likes dogs and likes cats,,aquarius,english


As part of the analysis, certain questions need to answered. For each question, we will use Python to solve it.

#### Questions:

##### 1. Which columns in the dataset have missing values, and what percentage of data is missing in each column?



To find out which columns have missing values, we will use the `isnull()` and `any()` functions of Pandas by running the following Python snippet. The snippet will return the column name and `True` or `False`. `True` means there are missing values present in the column and `False` means that there are no missing values in the column.

In [6]:
print(df.isnull().any())

age            False
status         False
gender         False
body_type       True
diet            True
drinks          True
education       True
ethnicity       True
height          True
income         False
job             True
last_online    False
location       False
pets            True
religion        True
sign            True
speaks          True
dtype: bool


To find out the percentage of null values in each column we will use the `isnull()` function to check if a particular value is null or not, and then to count of them, we will use the `sum()` function. Finally, to obtain a percentage, the sum will be divided by the number of rows of the column which can be found by using the `shape[]` function and multiplying it by 100. Run the following Python snippet to find the percentage of nulls in each column.

In [14]:
print(round(df.isnull().sum()/df.shape[0]*100,2))

age             0.00
status          0.00
gender          0.00
body_type       8.83
diet           40.69
drinks          4.98
education      11.06
ethnicity       9.48
height          0.01
income          0.00
job            13.68
last_online     0.00
location        0.00
pets           33.23
religion       33.74
sign           18.44
speaks          0.08
dtype: float64


##### 2. Are there columns where more than 50% of the data is missing? Would you drop those columns where missing values are >50%. If yes, why?

To check if any column has more than 50% null values, run the following Python snippet. It will return:
-  `True` if there are more than 50% nulls in the column
- `False` if there are not more than 50% nulls in the column 

In [15]:
print(df.isnull().sum()/df.shape[0]*100 > 50)

age            False
status         False
gender         False
body_type      False
diet           False
drinks         False
education      False
ethnicity      False
height         False
income         False
job            False
last_online    False
location       False
pets           False
religion       False
sign           False
speaks         False
dtype: bool


None of the columns have more than 50% of null values.

##### 3. How would you handle the missing numerical data (e.g., `height`, `income`)? Would you impute the missing data by the median or average value of `height` and `income` for the corresponding category, such as `gender`, `age` group, or `location`. If yes, why?

To decide on whether to impute missing numerical data like `height` and `income` with mean or median, we need to find out the distribution of both columns. 
 - For `height` we can consider data from the `gender` category as different genders will have different middle heights. 

To check whether mean or median is to be used for `height`, we will calculate the skewness of it grouped by `gender`: 

In [5]:
height_skew_by_gender = df.groupby('gender')['height'].skew()
print(height_skew_by_gender)

height_desc_by_gender = df.groupby('gender')['height'].describe()
print(height_desc_by_gender)

gender
f   -0.696659
m   -1.081008
Name: height, dtype: float64
          count       mean       std  min   25%   50%   75%   max
gender                                                           
f       24116.0  65.103873  2.926502  4.0  63.0  65.0  67.0  95.0
m       35827.0  70.443492  3.076521  1.0  68.0  70.0  72.0  95.0


From calculating skewness and looking at the description of `height` when grouped by `gender`, it is visible that `height` is moderately skewed for females and heavily skewed for males. Also, the minimum and maximum values for `height` are both very unrealistic, so it would be safer to impute it using median.

For imputing missing values of `income` is more complex as it is dependent on many factors such as `age`, `ethnicity`, `education`, `job` and `location`. Since there are multiple plausible factors to predict income, there may not be enough data points to make a meaningful imputation based on mean or median.

But generally, it is not advised to impute values if the upcoming analysis isn't susceptible to them.

### 2. Data Types

Accurate data types are critical for meaningful analysis and visualization. For example, numeric fields like `income` or `height` must be stored as numbers for statistical computations, while dates like `last_online` must be converted to datetime format for time-based calculations.

#### Questions:

##### 1. Are there any inconsistencies in the data types across columns (e.g., numerical data stored as strings)?

To check the data types of columns we would need to use the `dtypes` method.

In [7]:
df.dtypes

age              int64
status          object
gender          object
body_type       object
diet            object
drinks          object
education       object
ethnicity       object
height         float64
income           int64
job             object
last_online     object
location        object
pets            object
religion        object
sign            object
speaks          object
dtype: object

So from the above output we can see that most columns are consistent with their data types except for a few:
- `last_online` : It should be a datetime format