# Analysing Bumble Profiles using Python

## Dataset

The dataset is available from the [bumble.csv](bumble.csv) file. The following columns are in the dataset:
1. `age`: Age of the user.
2. `status`: Relationship status (e.g., single, married, seeing someone).
3. `gender`: Gender of the user (e.g., m, f).
4. `body_type`: Descriptions of physical appearance (e.g., athletic, curvy, thin).
5. `diet`: Dietary preferences (e.g., vegetarian, vegan, anything).
6. `drinks`: Drinking habits (e.g., socially, often).
7. `education`: Education level (e.g., college, masters).
8. `ethnicity`: Ethnicity of the user (e.g., white, asian, black).
9. `height`: Height of the user (in inches).
10. `income`: User-reported annual income.
11. `job`: Job sector of the user (e.g., sales, marketing, student)
12. `last_online`: Date and time when the user was last active.
13. `location`: City and state where the user resides.
14. `pets`: Pet preference of the user (e.g.,likes cats, has dogs)
15. `religion`: Religion of the user.
16. `sign`: Star sign of the user.
17. `speaks`: Languages the user speaks.

## Part 1: Data Cleaning

### 1. Inspecting Missing Data

Missing data is a common issue in real-world datasets. On a platform like Bumble, missing user information might reflect gaps in the user profile setup process, incomplete data collection, or users intentionally leaving certain fields blank.

Now we need to assess the extent of missing data, understand its potential impact, and decide the most appropriate methods to address it.

First, we will need to load the dataset from the CSV file into a Pandas Dataframe.

Before we can write the python code, we would need to install the `pandas` library by running the following command in your terminal (Bash for Linux users and Powershell for Windows users). Make sure that Python is in your PATH.

In [None]:
pip install pandas

Once, `pandas` is successfully installed, run the following python code snippet to load the CSV file into a Pandas Dataframe.

In [None]:
import pandas as pd

df = pd.read_csv("bumble.csv")
df.head()

As part of the analysis, certain questions need to answered. For each question, we will use Python to solve it.

#### Questions:

##### 1. Which columns in the dataset have missing values, and what percentage of data is missing in each column?



To find out which columns have missing values, we will use the `isnull()` and `any()` functions of Pandas by running the following Python snippet. The snippet will return the column name and `True` or `False`. `True` means there are missing values present in the column and `False` means that there are no missing values in the column.

In [6]:
print(df.isnull().any())

age            False
status         False
gender         False
body_type       True
diet            True
drinks          True
education       True
ethnicity       True
height          True
income         False
job             True
last_online    False
location       False
pets            True
religion        True
sign            True
speaks          True
dtype: bool


To find out the percentage of null values in each column we will use the `isnull()` function to check if a particular value is null or not, and then to count of them, we will use the `sum()` function. Finally, to obtain a percentage, the sum will be divided by the number of rows of the column which can be found by using the `shape[]` function and multiplying it by 100. Run the following Python snippet to find the percentage of nulls in each column.

In [14]:
print(round(df.isnull().sum()/df.shape[0]*100,2))

age             0.00
status          0.00
gender          0.00
body_type       8.83
diet           40.69
drinks          4.98
education      11.06
ethnicity       9.48
height          0.01
income          0.00
job            13.68
last_online     0.00
location        0.00
pets           33.23
religion       33.74
sign           18.44
speaks          0.08
dtype: float64


##### 2. Are there columns where more than 50% of the data is missing? Would you drop those columns where missing values are >50%. If yes, why?

To check if any column has more than 50% null values, run the following Python snippet. It will return:
-  `True` if there are more than 50% nulls in the column
- `False` if there are not more than 50% nulls in the column 

In [15]:
print(df.isnull().sum()/df.shape[0]*100 > 50)

age            False
status         False
gender         False
body_type      False
diet           False
drinks         False
education      False
ethnicity      False
height         False
income         False
job            False
last_online    False
location       False
pets           False
religion       False
sign           False
speaks         False
dtype: bool


None of the columns have more than 50% of null values.