# Data Cleaning, Preprocessing and Exploratory Data Analysis

## Preliminary preparation

### 1. Importing the libraries

In [29]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import nbformat
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

### 2. Importing the data

In [30]:
df = pd.read_csv('data.csv')
df.head()

Unnamed: 0,Serial-No,Period,Period.Year,Period.Quarter,Period.Month,Period.Day,2. Has it become more difficult or easier to find a job in your city?,"3. Is this a good time for people to make a large purchase such as furniture or electrical appliances, given the economic climate?","4. Compared to the last 6 months, are you able to spend (more, the same or less) money on large purchases (furniture, electrical appliances) over the next 6 months?",5. Will you be able to meet your regular expenses over the next 6 months?,...,"23. In the past few months, what have you started doing, or done more than before, to deal with the rising prices?",24. Over the past few months have you had to borrow money to meet your day-to-day expenses due to the rising prices?,25. What do you believe is the main reason that is causing the increase in food and fuel prices?,26. Thinking about the amount of stress in your life as a result of rising prices and financial issues How would you describe most of your days recently?,"27. In your opinion, what do you think is the best way to combat the increase in prices in Africa",28. Which of the following things are you currently doing to help you curb/contain inflation?,"29. Compared to a year ago, How have your eating habits changed because of the rising food prices?","30. Compared to a year ago, How have your eating habits for meat and milk changed because of the rising food prices?",31. Which of the following types of alcoholic beverages have you drunk in the past 3 months?,32. Which ONE Alcohol category do you currently drink more often than any of the other types of alcohol?
0,R3229,2022-06-20,2022,2,6,20,Difficult,Yes,Same,No,...,Delayed non-essential purchases,"No, have not borrowed",Impact of COVID-19 pandemic,Quite stressful,Governments should support the local productio...,"I stopped specific diets e.g Keto, Vegan, Glut...",,,,
1,R3377,2022-06-20,2022,2,6,20,Difficult,Yes,More,Maybe,...,"Other changes in purchasing habits (e.g., buyi...","Yes, had to borrow a few times",Impact of Russia's invasion of Ukraine,Quite stressful,People should save more,I am going to bars less often;I am purchasing ...,,,,
2,R2149,2022-11-20,2022,4,11,20,Difficult,No,Less,Yes,...,Searched for sales and promotions;Purchased ch...,"Yes, had to borrow a few times",Impact of Russia's invasion of Ukraine,Quite stressful,People should spend less;People should save more,I am purchasing less produce;I am purchasing l...,,,,
3,R3410,2022-06-20,2022,2,6,20,Difficult,Maybe,More,Maybe,...,Searched for sales and promotions;Purchased ch...,"Yes, had to borrow often",Impact of Russia's invasion of Ukraine,Quite stressful,Governments should support the local productio...,I am eating out at restaurants less often;I am...,,,,
4,R2099,2022-11-20,2022,4,11,20,Difficult,No,Less,Maybe,...,"Other changes in purchasing habits (e.g., buyi...","Yes, had to borrow a few times",Impact of COVID-19 pandemic,Quite stressful,Governments should support the local productio...,I am stocking up on cheap street/local market ...,,,,


### 3. Understanding the data structure

In [31]:
df.shape

(3934, 55)

The data has about 4000 rows and 55 columns

In [32]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3934 entries, 0 to 3933
Data columns (total 55 columns):
 #   Column                                                                                                                                                                                                 Non-Null Count  Dtype 
---  ------                                                                                                                                                                                                 --------------  ----- 
 0   Serial-No                                                                                                                                                                                              3934 non-null   object
 1   Period                                                                                                                                                                                                 3934 non

In [33]:
df.describe()

Unnamed: 0,Period.Year,Period.Quarter,Period.Month,Period.Day
count,3934.0,3934.0,3934.0,3934.0
mean,2023.123284,2.763091,7.528724,20.0
std,0.776139,1.198482,3.317439,0.0
min,2022.0,1.0,3.0,20.0
25%,2023.0,2.0,6.0,20.0
50%,2023.0,3.0,7.0,20.0
75%,2024.0,4.0,11.0,20.0
max,2024.0,4.0,12.0,20.0


The numerical columns in the data are those with temporal data

### 4. Data preparation

In [34]:
df.columns

Index(['Serial-No', 'Period', 'Period.Year', 'Period.Quarter', 'Period.Month',
       'Period.Day',
       '2. Has it become more difficult or easier to find a job in your city?',
       '3. Is this a good time for people to make a large purchase such as furniture or electrical appliances, given the economic climate?',
       '4. Compared to the last 6 months, are you able to spend (more, the same or less) money on large purchases (furniture, electrical appliances) over the next 6 months?',
       '5. Will you be able to meet your regular expenses over the next 6 months?',
       '6. How do you expect your household's income to change over the next 6 months?',
       '7. How do you expect general economic conditions in your city to change over the next 6 months?',
       '8. How do you expect general economic conditions in your country to change over the next 6 months?',
       '8a.Compared to the last 6 months, has it become more difficult or easier to make money in your city?',
     

Since the columns are each very long research questions, we need to shorten them to ease analysis

In [35]:
column_mapping = {
'Serial-No': 'Serial_No',
 'Period': 'Period',
 'Period.Year': 'Period.Year',
 'Period.Quarter': 'Period.Quarter',
 'Period.Month': 'Period.Month',
 'Period.Day': 'Period.Day',
 '2. Has it become more difficult or easier to find a job in your city?': 'Job_Market_Difficulty',
 '3. Is this a good time for people to make a large purchase such as furniture or electrical appliances, given the economic climate?': 'Good_Time_For_Large_Purchase',
 '4. Compared to the last 6 months, are you able to spend (more, the same or less) money on large purchases (furniture, electrical appliances) over the next 6 months?': 'Future_Spend_On_Large_Purchases',
 '5. Will you be able to meet your regular expenses over the next 6 months?': 'Able_To_Meet_Regular_Expenses',
 "6. How do you expect your household's income to change over the next 6 months?": 'Expected_Income_Change',
 '7. How do you expect general economic conditions in your city to change over the next 6 months?': 'Expected_Economic_Change_City',
 '8. How do you expect general economic conditions in your country to change over the next 6 months?' : 'Expected_Economic_Change_Country',
 '8a.Compared to the last 6 months, has it become more difficult or easier to make money in your city?': 'Earning_Difficulty',
 '8b. Do you think your country is heading in the right direction or the wrong direction?': 'Country_Direction_Thoughts',
 '8c. How many meals do you take a day?': 'Meals_Per_Day',
 '9. Gender': 'Gender',
 '10. Marital status': 'Marital_Status',
 '11. Age': 'Age',
 '11b. Age-group': 'Age_Group',
 '12. What\'s your highest level of education?': 'Highest_Education_Level',
 '13. Occupation': 'Occupation',
 '14. Which population group do you identify with?': 'Population_Group',
 '15. Do you have any children under the age of 18 years living in your home?' : 'Children_Under_18',
 '16. What is the size of your household (number of people in the family)?': 'Household_Size',
 "18. What would you say is the COMBINED TOTAL amount of money that everyone in the household earns on an average month after tax.? This includes everyone's salaries and income from their businesses?": 'Total_Household_Income',
 'Income Level': 'Income_Level',
 '19. Now, thinking about the things that are constantly worrying you and your community lately What would you say are the top three issues that you are most concerned about?': 'Top_3_Concerns',
 '20. Over the past few months when shopping for Food & Beverages (groceries, take-out, etc), what have you noticed in terms of their prices?': 'Food_Beverages_Prices_Change',
 '20. Over the past few months when shopping for Utilities (Water, Electricity, Gas, etc), what have you noticed in terms of their prices?': 'Utilities_Prices_Change',
 '20. Over the past few months when shopping for Household furniture & appliances (Beds, Fridge etc), what have you noticed in terms of their prices?': 'Furniture_Appliances_Prices_Change',
 '20. Over the past few months when shopping for Alcoholic Beverages, what have you noticed in terms of their prices?': 'Alcoholics_Prices_Change',
 '20. Over the past few months when paying for Housing/Rent, what have you noticed in terms of their prices?': 'Housing_Rent_Prices_Change',
 '20. Over the past few months when paying for Transportation (Bus fare, Fuel etc), what have you noticed in terms of their prices?': 'Transportation_Prices_Change',
 '20. Over the past few months when shopping for Clothing & Shoes, what have you noticed in terms of their prices?': 'Clothing_Shoes_Prices_Change',
 '20. Over the past few months when Paying for Education (School fees, school supplies), what have you noticed in terms of their prices?': 'Education_Prices_Change',
 '20. Over the past few months when Paying for Recreation (social activities, going out, movies, travelling, sports), what have you noticed in terms of their prices?': 'Recreation_Prices_Change',
 '20. Over the past few months when shopping for Communication (Airtime, Mobile data and internet costs), what have you noticed in terms of their prices?': 'Communication_Prices_Change',
 '21. How concerned are you about the rising prices?': 'Concern_About_Rising_Prices',
 '22. Rating 1-4 Where 1 is no impact and 4 is A lot of impact. How are the rising prices negatively impacting your ability to meet my day to day expenses (food, transport, airtime)': 'Rising_Prices_Impact_Day_To_Day_Expenses',
 '22. Rating 1-4 Where 1 is no impact and 4 is A lot of impact. How are the rising prices negatively impacting your ability to make discretionary spending (furniture, car, travel)': 'Rising_Prices_Impact_Discretionary_Spending',
 '22. Rating 1-4 Where 1 is no impact and 4 is A lot of impact. How are the rising prices negatively impacting your ability to find an affordable housing': 'Rising_Prices_Impact_Affordable_Housing',
 '22. Rating 1-4 Where 1 is no impact and 4 is A lot of impact. How are the rising prices negatively impacting your ability to invest, save for my retirement': 'Rising_Prices_Impact_Invest_Retirement',
 '22. Rating 1-4 Where 1 is no impact and 4 is A lot of impact. How are the rising prices negatively impacting your ability to get healthcare (doctors, exams, medications, etc)': 'Rising_Prices_Impact_Healthcare',
 '22. Rating 1-4 Where 1 is no impact and 4 is A lot of impact. How are the rising prices negatively impacting your ability to get organic food to eat and/or healthier meals': 'Rising_Prices_Impact_Organic_Food',
 '22. Rating 1-4 Where 1 is no impact and 4 is A lot of impact. How are the rising prices negatively impacting your ability to get education for yourself or your children': 'Rising_Prices_Impact_Education',
 '23. In the past few months, what have you started doing, or done more than before, to deal with the rising prices?': 'Dealing_With_Rising_Prices',
 '24. Over the past few months have you had to borrow money to meet your day-to-day expenses due to the rising prices?': 'Borrowed_Money_For_Rising_Prices',
 '25. What do you believe is the main reason that is causing the increase in food and fuel prices?': 'Reason_For_Price_Increase',
 '26. Thinking about the amount of stress in your life as a result of rising prices and financial issues How would you describe most of your days recently?': 'Stress_Level',
 '27. In your opinion, what do you think is the best way to combat the increase in prices in Africa': 'Combat_Price_Increase',
 '28. Which of the following things are you currently doing to help you curb/contain inflation?': 'Curb_Inflation_Methods',
 '29. Compared to a year ago, How have your eating habits changed because of the rising food prices?': 'Eating_Habits_Change',
 '30. Compared to a year ago, How have your eating habits for meat and milk changed because of the rising food prices?': 'Meat_Milk_Habits_Change',
 '31. Which of the following types of alcoholic beverages have you drunk in the past 3 months?': 'Alcohol_Types_Drunk',
 '32. Which ONE Alcohol category do you currently drink more often than any of the other types of alcohol?': 'Alcohol_Category_Most_Drunk',
}

In [36]:
df = df.rename(columns=column_mapping)
df.columns.tolist()

['Serial_No',
 'Period',
 'Period.Year',
 'Period.Quarter',
 'Period.Month',
 'Period.Day',
 'Job_Market_Difficulty',
 'Good_Time_For_Large_Purchase',
 'Future_Spend_On_Large_Purchases',
 'Able_To_Meet_Regular_Expenses',
 'Expected_Income_Change',
 'Expected_Economic_Change_City',
 'Expected_Economic_Change_Country',
 'Earning_Difficulty',
 'Country_Direction_Thoughts',
 'Meals_Per_Day',
 'Gender',
 'Marital_Status',
 'Age',
 'Age_Group',
 'Highest_Education_Level',
 'Occupation',
 'Population_Group',
 'Children_Under_18',
 'Household_Size',
 'Total_Household_Income',
 'Income_Level',
 'Top_3_Concerns',
 'Food_Beverages_Prices_Change',
 'Utilities_Prices_Change',
 'Furniture_Appliances_Prices_Change',
 'Alcoholics_Prices_Change',
 'Housing_Rent_Prices_Change',
 'Transportation_Prices_Change',
 'Clothing_Shoes_Prices_Change',
 'Education_Prices_Change',
 'Recreation_Prices_Change',
 'Communication_Prices_Change',
 'Concern_About_Rising_Prices',
 'Rising_Prices_Impact_Day_To_Day_Expenses

### 5. Handling duplicates

In [37]:
print(df.duplicated().sum())

0


Dataset has no duplicates

### 6. Handling missing figures

In [38]:
print(df.columns[df.isnull().any()])

Index(['Country_Direction_Thoughts', 'Meals_Per_Day', 'Household_Size',
       'Combat_Price_Increase', 'Eating_Habits_Change',
       'Meat_Milk_Habits_Change', 'Alcohol_Types_Drunk',
       'Alcohol_Category_Most_Drunk'],
      dtype='object')


There are 8 columns that contain a missing value. Next step is to determine how much data is missing in each thus determine next course of action

In [39]:
missing_val_per_column = df.isnull().sum()
percentage_missing = ((missing_val_per_column / len(df)) * 100).round(2)
missing_data = pd.DataFrame({'Missing Values': missing_val_per_column[missing_val_per_column > 0], 'Percentage Missing': percentage_missing[missing_val_per_column > 0]}).sort_values(by='Percentage Missing', ascending=True)

missing_data

Unnamed: 0,Missing Values,Percentage Missing
Combat_Price_Increase,1,0.03
Household_Size,470,11.95
Meat_Milk_Habits_Change,972,24.71
Eating_Habits_Change,972,24.71
Alcohol_Types_Drunk,2187,55.59
Alcohol_Category_Most_Drunk,2194,55.77
Country_Direction_Thoughts,2477,62.96
Meals_Per_Day,3444,87.54


Seeing the varying anmount of missing data, for each case we decided to use different ways to hadle each.

1. **Country_Direction_Thoughts && Meals_Per_Day** - as it is missing more that 60% and is not going to be used later on, we have decided to drop them.
2. **Household_Size** - as it has a low missing count, we will fill with "Not Specified" as it is categorical but not important enough to perform model based imputation for.
3. **Combat_Price_Increase** - only one row is missing so it will be filled with the mode.
4. **Remaining** - as they have a moderate-high missing count, I will introduce a new category "Not Provided". _Eating_Habits_Change_, _Meat_Milk_Habits_Change_, _Alcohol_Types_Drunk_ come in with multiple responses thus a modal imputation would not represent well. For Alcohol_Category_Most_Drunk, since we are not taking the data from _Alcohol_Types_Drunk_, I decided to also fill them with a new category.

In [40]:
# Dropping columns with more than 60% missing values
df.drop(columns=['Country_Direction_Thoughts', 'Meals_Per_Day'], axis=1, inplace=True)

# Fill in missing value for Combat Price Increase with mode
df['Combat_Price_Increase'] = df['Combat_Price_Increase'].fillna(df['Combat_Price_Increase'].mode()[0])

# Converting to category and adding a new category 'Not Specified' then back to object datatype
df['Household_Size'] = df['Household_Size'].astype('category')
df['Household_Size'] = df['Household_Size'].cat.add_categories("Not Specified")
df['Household_Size'] = df['Household_Size'].astype('object') 

# Filling in remaining rows with missing values with 'Not Provided'
df.fillna('Not Provided', inplace=True)

### 7. Handling inconsistencies

In [41]:
df.dtypes

Serial_No                                      object
Period                                         object
Period.Year                                     int64
Period.Quarter                                  int64
Period.Month                                    int64
Period.Day                                      int64
Job_Market_Difficulty                          object
Good_Time_For_Large_Purchase                   object
Future_Spend_On_Large_Purchases                object
Able_To_Meet_Regular_Expenses                  object
Expected_Income_Change                         object
Expected_Economic_Change_City                  object
Expected_Economic_Change_Country               object
Earning_Difficulty                             object
Gender                                         object
Marital_Status                                 object
Age                                            object
Age_Group                                      object
Highest_Education_Level     

- The following require attention:
    - Period (object) -> (datetime)

In [42]:
df['Period'] = pd.to_datetime(df['Period'], format='%Y-%m-%d')

## Exploratory Data Analysis (EDA)

### 1. Numerical Columns

In [43]:
df.describe()

Unnamed: 0,Period,Period.Year,Period.Quarter,Period.Month,Period.Day
count,3934,3934.0,3934.0,3934.0,3934.0
mean,2023-09-19 20:55:30.960854016,2023.123284,2.763091,7.528724,20.0
min,2022-06-20 00:00:00,2022.0,1.0,3.0,20.0
25%,2023-03-20 00:00:00,2023.0,2.0,6.0,20.0
50%,2023-07-20 00:00:00,2023.0,3.0,7.0,20.0
75%,2024-03-20 00:00:00,2024.0,4.0,11.0,20.0
max,2024-12-20 00:00:00,2024.0,4.0,12.0,20.0
std,,0.776139,1.198482,3.317439,0.0


In [44]:
df['Period.Year'].value_counts()

Period.Year
2023    1505
2024    1457
2022     972
Name: count, dtype: int64

In [45]:
df['Period.Quarter'].value_counts()    

Period.Quarter
4    1492
3     995
1     977
2     470
Name: count, dtype: int64

In [46]:
df['Period.Month'].value_counts() 

Period.Month
11    1002
7      995
3      977
12     490
6      470
Name: count, dtype: int64

#### Insights on Time

From the information in the table above the following insights can be derived.

- The study was carried out between 2022-2024
- The study was carried out on the 20th day of every month
- Most observations were in the year 2023 and least in 2022 -- (COVID issues possibility)
- Observations were done in 5 months; Mar, Jun, Jul, Nov, Dev for each quarter
- Most observations were done in the last quarter (Nov, Dec)

### 2. Categorical Columns

As most of the remaining columns are categorical, we can generate frequency destributions for only those that will be used in the later hypotheses and research questions.

1. Age_Group

2. Gender

3. Highest_Education_Level

4. Population_Group

5. Occupation

6. Job_Market_Difficulty

7. Income_Level

8. Total_Household_Income

9. Children_Under_18

10. Household_Size

#### Insights on Demographic Variables

In [47]:
fig = make_subplots(rows=2, cols=2, subplot_titles=('Age Group Distribution', 'Gender Distribution', 'Education Level Distribution', 'Population Group Distribution'))

fig.add_trace(go.Bar(x=df['Age_Group'].value_counts().index, y=df['Age_Group'].value_counts().values), row=1, col=1)

fig.add_trace(go.Bar(x=df['Gender'].value_counts().index, y=df['Age_Group'].value_counts().values), row=1, col=2)

fig.add_trace(go.Bar(x=df['Highest_Education_Level'].value_counts().index, y=df['Highest_Education_Level'].value_counts().values), row=2, col=1)

fig.add_trace(go.Bar(x=df['Population_Group'].value_counts().index, y=df['Population_Group'].value_counts().values), row=2, col=2) 

fig.update_layout(height=800, width=800, title_text="Demographic Distribution", showlegend=False)

fig.update_xaxes(title_text="Age Group", row=1, col=1)
fig.update_xaxes(title_text='Gender', row=1, col=2)
fig.update_xaxes(title_text='Education Level', row=2, col=1)
fig.update_xaxes(title_text='Population Group', row=2, col=2)

fig.update_yaxes(title_text="Count", row=1, col=1)
fig.update_yaxes(title_text='Count', row=1, col=2)
fig.update_yaxes(title_text='Count', row=2, col=1)
fig.update_yaxes(title_text='Count', row=2, col=2)

fig.show()

print("Age_Group value counts:\n", df['Age_Group'].value_counts(), "\n")
print("Gender value counts:\n", df['Gender'].value_counts(), "\n")
print("Highest_Education_Level value counts:\n", df['Highest_Education_Level'].value_counts(), "\n")
print("Population_Group value counts:\n", df['Population_Group'].value_counts(), "\n")

Age_Group value counts:
 Age_Group
Millennials     1984
Gen Z           1093
Gen X            640
Baby Boomers     217
Name: count, dtype: int64 

Gender value counts:
 Gender
Male      1974
Female    1960
Name: count, dtype: int64 

Highest_Education_Level value counts:
 Highest_Education_Level
University degree             1346
Vocational training           1276
Diploma                        762
High school                    374
Studying                        86
Professional Qualification      58
Up to Primary School            32
Name: count, dtype: int64 

Population_Group value counts:
 Population_Group
Black              3900
Mixed race           15
White                10
Indian or Asian       9
Name: count, dtype: int64 




From these demographic variables we can conclude that:
- The data was collected was predominantly from Millenials and Gen Zs.
- The data was collected from about twice more males than females.
- The education level is fairly high with most people having finished high school.
- The dominant group was black during collection.

In [48]:
crosstab = pd.crosstab(df['Occupation'], df['Job_Market_Difficulty']).sort_values(by='Difficult', ascending=False)
crosstab = crosstab[['Difficult', 'Same', 'Easier']]
crosstab

Job_Market_Difficulty,Difficult,Same,Easier
Occupation,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Salaried employee,1501,363,38
Self employed/Contractor,519,147,34
Business owner,507,144,38
Unemployed,262,90,20
Student,211,33,6
Stay-at-home mom or dad,19,2,0


In [49]:
fig2 = make_subplots(rows=1, cols=3, subplot_titles=('Occupation Distribution', 'Job Market Difficulty Distribution',  'Occupation and Job Market Difficulty Comparison'))

fig2.add_trace(
    go.Bar(
        x=df['Occupation'].value_counts().index,
        y=df['Occupation'].value_counts().values
    ),
    row=1, col=1
)

fig2.add_trace(
    go.Bar(
        x=df['Job_Market_Difficulty'].value_counts().index,
        y=df['Job_Market_Difficulty'].value_counts().values
    ),
    row=1, col=2
)

for difficulty in crosstab.columns:
    fig2.add_trace(
        go.Bar(
        x=crosstab.index,
        y=crosstab[difficulty],
        name=difficulty,
        marker_color='darkred' if difficulty == 'Difficult' else 'goldenrod' if difficulty == 'Same' else 'forestgreen'),
                  row=1, col=3)

fig2.update_layout(
    height=600, width=1400, title_text="Occupation and Job Market Difficulty Distribution", showlegend=False
)
fig2.update_xaxes(title_text="Occupation", row=1, col=1)
fig2.update_xaxes(title_text="Job Market Difficulty", row=1, col=2)
fig2.update_yaxes(title_text="Count", row=1, col=1)
fig2.update_yaxes(title_text="Count", row=1, col=2)

fig2.show()

print("Occupation value counts:\n", df['Occupation'].value_counts(), "\n")
print("Job_Market_Difficulty value counts:\n", df['Job_Market_Difficulty'].value_counts(), "\n")

Occupation value counts:
 Occupation
Salaried employee           1902
Self employed/Contractor     700
Business owner               689
Unemployed                   372
Student                      250
Stay-at-home mom or dad       21
Name: count, dtype: int64 

Job_Market_Difficulty value counts:
 Job_Market_Difficulty
Difficult    3019
Same          779
Easier        136
Name: count, dtype: int64 



#### Insights on Occupational Variables

In [50]:
fig3 = make_subplots(rows=2, cols=2, subplot_titles=(
    'Income Level Distribution',
    'Total Household Income Distribution',
    'Children Under 18 Distribution',
    'Household Size Distribution'
))

fig3.add_trace(
    go.Bar(
        x=df['Income_Level'].value_counts().index,
        y=df['Income_Level'].value_counts().values
    ),
    row=1, col=1
)

fig3.add_trace(
    go.Bar(
        x=df['Total_Household_Income'].value_counts().index,
        y=df['Total_Household_Income'].value_counts().values
    ),
    row=1, col=2
)

fig3.add_trace(
    go.Bar(
        x=df['Children_Under_18'].value_counts().index,
        y=df['Children_Under_18'].value_counts().values
    ),
    row=2, col=1
)

fig3.add_trace(
    go.Bar(
        x=df['Household_Size'].value_counts().index,
        y=df['Household_Size'].value_counts().values
    ),
    row=2, col=2
)

fig3.update_layout(
    height=800, width=900, title_text="Income and Household Characteristics Distribution", showlegend=False
)

fig3.update_xaxes(title_text="Income Level", row=1, col=1)
fig3.update_xaxes(title_text="Total Household Income", row=1, col=2)
fig3.update_xaxes(title_text="Children Under 18", row=2, col=1)
fig3.update_xaxes(title_text="Household Size", row=2, col=2)

fig3.update_yaxes(title_text="Count", row=1, col=1)
fig3.update_yaxes(title_text="Count", row=1, col=2)
fig3.update_yaxes(title_text="Count", row=2, col=1)
fig3.update_yaxes(title_text="Count", row=2, col=2)

fig3.show()

print("Income_Level value counts:\n", df['Income_Level'].value_counts(), "\n")
print("Total_Household_Income value counts:\n", df['Total_Household_Income'].value_counts(), "\n")
print("Children_Under_18 value counts:\n", df['Children_Under_18'].value_counts(), "\n")
print("Household_Size value counts:\n", df['Household_Size'].value_counts(), "\n")

Income_Level value counts:
 Income_Level
Middle Income    2058
Low Income       1515
High Income       361
Name: count, dtype: int64 

Total_Household_Income value counts:
 Total_Household_Income
Under USD500                     1515
Between USD501 and USD900        1200
Between USD901 and USD1,800       858
Between USD1,801 and USD4,500     296
More than USD4,500                 65
Name: count, dtype: int64 

Children_Under_18 value counts:
 Children_Under_18
Yes    2687
No     1247
Name: count, dtype: int64 

Household_Size value counts:
 Household_Size
3 or 4 people         1664
Up to 2 people         997
5 or 6 people          711
Not Provided           470
More than 6 people      92
Name: count, dtype: int64 



In [51]:
occupation_percentage = (df['Occupation'].value_counts(normalize=True) * 100).round(2)
occupation_percentage_df = occupation_percentage.reset_index()
occupation_percentage_df.columns = ['Occupation', 'Percentage']

occupation_percentage_df

Unnamed: 0,Occupation,Percentage
0,Salaried employee,48.35
1,Self employed/Contractor,17.79
2,Business owner,17.51
3,Unemployed,9.46
4,Student,6.35
5,Stay-at-home mom or dad,0.53


In [52]:
# Normalize the crosstab by rows to get percentages
crosstab_percentage = (crosstab.div(crosstab.sum(axis=1), axis=0) * 100).round(2)

crosstab_percentage

Job_Market_Difficulty,Difficult,Same,Easier
Occupation,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Salaried employee,78.92,19.09,2.0
Self employed/Contractor,74.14,21.0,4.86
Business owner,73.58,20.9,5.52
Unemployed,70.43,24.19,5.38
Student,84.4,13.2,2.4
Stay-at-home mom or dad,90.48,9.52,0.0


From the data gotten above we can reach the following conclusions:
- The job market analyzed is predominatly of salaried employees.
- Most people find that the job market is getting more difficult.
- Across all occupations, respondents agree that the job market is most difficult with even stay-at-home parents agreeing so.

#### Insights from the Financial Variables

In [77]:
fig4 = make_subplots(rows=1, cols=2, subplot_titles=('Income Level Distribution',
                                                    'Total Household Income Distribution',
                                                    ))

fig4.add_trace(go.Bar(x=df['Income_Level'].value_counts().index, y=df['Income_Level'].value_counts().values, showlegend=False), row=1, col=1)

fig4.add_trace(go.Bar(x=df['Total_Household_Income'].value_counts().index, y=df['Total_Household_Income'].value_counts().values, showlegend=False), row=1, col=2)
    

fig4.update_layout(height=500, width=1000, title_text="Financial Variables Distribution") 

fig4.update_layout(
    legend_tracegroupgap=30,
    legend_title_text="Legend Groups"
)
fig4.update_yaxes(title_text="Count", row=1, col=1)
fig4.update_yaxes(title_text='Count', row=1, col=2)

fig4.update_xaxes(title_text="Income Level", row=1, col=1)
fig4.update_xaxes(title_text="Total Household Income", row=1, col=2)

fig4.show()

1. **Income Level Distribution**:
    - The majority of respondents fall under the "Middle Income" category, followed by "Low Income" and then "High Income."

2. **Total Household Income Distribution**:
    - Most respondents have a total household income of "Under USD500" or "Between USD901 and USD1,800."
    - Very few respondents have a household income of "More than USD4,500."

3. **Concern About Rising Prices**:
    - A significant portion of respondents are "Quite concerned" about rising prices.
    - A smaller group is "Neutral," and very few are "Not concerned."

4. **Borrowed Money Distribution**:
    - Many respondents have "had to borrow a few times" to meet their expenses due to rising prices.
    - A smaller group "had to borrow often," while some "have not borrowed."

5. **Ability to Meet Regular Expenses**:
    - A majority of respondents answered "Yes" to being able to meet regular expenses.
    - A smaller group answered "Maybe," and the least answered "No."
6. **Stress Level**:
    - Most people are stressed about the situation of their finances and the economy.

7. **Income Level and Concern About Rising Prices**:
    - Across all income levels, the majority of respondents are "Quite concerned" about rising prices.
    - "Neutral" and "Not concerned" responses are less frequent, especially in the "Low Income" group.

8. **Borrowed Money and Expenses Ability**:
    - Respondents who "had to borrow often" or "a few times" are more likely to answer "Maybe" or "Yes" regarding their ability to meet regular expenses.
    - Most people who borrow money are possibly able to meet their regular expenses thus might be their promay reason for borrowing.


### Insights from Household Characteristics Distribution

In [62]:
size_income_crosstab = pd.crosstab(df['Household_Size'], df['Total_Household_Income'])
size_income_crosstab

Total_Household_Income,"Between USD1,801 and USD4,500",Between USD501 and USD900,"Between USD901 and USD1,800","More than USD4,500",Under USD500
Household_Size,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
3 or 4 people,106,608,392,19,539
5 or 6 people,108,207,256,14,126
More than 6 people,22,19,25,11,15
Not Provided,31,119,91,7,222
Up to 2 people,29,247,94,14,613


In [70]:
fig5 = make_subplots(rows=1, cols=3, shared_yaxes=True, subplot_titles=(
    'Household Size Distribution', 
    'Children Under 18 Distribution', 
    'Household Income Distribution'
))

# Add bar plot for Household Size
fig5.add_trace(
    go.Bar(
        x=df['Household_Size'].value_counts().index, 
        y=df['Household_Size'].value_counts().values, 
        name='Household Size'
    ), 
    row=1, col=1
)

# Add bar plot for Children Under 18
fig5.add_trace(
    go.Bar(
        x=df['Children_Under_18'].value_counts().index, 
        y=df['Children_Under_18'].value_counts().values, 
        name='Children Under 18'
    ), 
    row=1, col=2
)

# Add bar plot for Household Income
fig5.add_trace(
    go.Bar(
        x=df['Total_Household_Income'].value_counts().index, 
        y=df['Total_Household_Income'].value_counts().values, 
        name='Household Income'
    ), 
    row=1, col=3
)

# Update layout
fig5.update_layout(
    height=600, 
    width=1200, 
    title_text="Household Characteristics Distribution", 
    showlegend=False
)

# Update y-axis title
fig5.update_yaxes(title_text="Count", row=1, col=1)
fig5.update_yaxes(title_text="Count", row=1, col=2)
fig5.update_yaxes(title_text="Count", row=1, col=3)

# Update x-axis titles
fig5.update_xaxes(title_text="Household Size", row=1, col=1)
fig5.update_xaxes(title_text="Children Under 18", row=1, col=2)
fig5.update_xaxes(title_text="Household Income", row=1, col=3)

# Show the plot
fig5.show()

1. **Household Size**:
    - The majority of households consist of "3 or 4 people."
    - Smaller households with "Up to 2 people" are the second most common.
    - Larger households with "5 or 6 people" or "More than 6 people" are less frequent.
    - A small portion of data is categorized as "Not Provided."

2. **Children Under 18**:
    - A significant number of respondents have children under 18 in their households.
    - Households without children under 18 are less common.

3. **Household Income**:
    - Most respondents fall under the "Under USD500" income category.
    - The second most common income range is "Between USD901 and USD1,800."
    - Higher income categories, such as "More than USD4,500," are the least represented.