<p style="text-align:center">
    <a href="https://skills.network" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>


# **Bar Charts**


Estimated time needed: **30** minutes


In this lab, you will focus on visualizing data.

The dataset will be provided to you in the form of an RDBMS.

You will use SQL queries to extract the necessary data.


## Objectives


In this lab you will perform the following:


-   Visualize the distribution of data

-   Visualize the relationship between two features

-   Visualize the composition of data

-   Visualize comparison of data


## Setup: Working with the Database
**Install the needed libraries**


In [None]:
!pip install pandas
!pip install numpy

In [None]:
!pip install matplotlib
!pip install seaborn

**Download and connect to the database file containing survey data.**


To start, download and load the dataset into a `pandas` DataFrame.



In [None]:
# Step 1: Download the dataset
!wget -O survey-data.csv https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/n01PQ9pSmiRX6520flujwQ/survey-data.csv

# Step 2: Import necessary libraries and load the dataset
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
# Load the data
df = pd.read_csv("survey-data.csv")

# Display the first few rows to understand the structure of the data
df.head()


### Task 1: Visualizing Data Distributions


##### 1. Histogram of `ConvertedCompYearly`


Visualize the distribution of yearly compensation (`ConvertedCompYearly`) using a histogram.



In [None]:
#filtering 1-99% to not have the outliers in the data I want to visualize. 
low = df['ConvertedCompYearly'].quantile(0.01)
high = df['ConvertedCompYearly'].quantile(0.99)
df['CompYearlyFiltered'] =  df['ConvertedCompYearly'][(df['ConvertedCompYearly'] >= low) & (df['ConvertedCompYearly'] <= high)]
df['CompYearlyFiltered']

In [None]:
## Write your code here
df['CompYearlyFiltered'].plot(kind = 'hist', bins = 20, edgecolor='black')
plt.title('Yearly Compensation in USD, filtered 1-99%')
plt.xlabel('Yearly Comp in USD')
plt.ylabel('Count of respondents per bin')
plt.show()

##### 2. Box Plot of `Age`


Since `Age` is categorical in the dataset, convert it to numerical values for a box plot.



In [None]:
## Write your code here
#defining a function to convert the Age column to numeric values

def Age_to_numeric(v):
    if pd.isna(v):
        return np.nan

    elif '25-34 years old' in v:
        return 30
    elif '35-44 years old' in v:
        return 40
    elif '18-24 years old' in v:
        return 22
    elif '45-54 years old' in v:
        return 50
    elif '55-64 years old' in v:
        return 60
    elif 'Under 18 years old' in v:
        return 17
    elif '65 years or older' in v:
        return 67
    elif  'Prefer not to say' in v:
        return 41 #the mean of the other values. 

#applying the function to the age Column and verifying that it worked. 
df['Age_numeric'] = df['Age'].apply(Age_to_numeric)
df['Age_numeric'].describe()

In [None]:
df.boxplot(column = 'Age_numeric', figsize = (10,6))
plt.title('Distribution of Respondents age')
plt.show()


### Task 2: Visualizing Relationships in Data


##### 1. Scatter Plot of `Age_numeric` and `ConvertedCompYearly`


Explore the relationship between age and compensation.



In [None]:
## Write your code here
plt.scatter(df['Age_numeric'],df['CompYearlyFiltered'], color = 'green')
plt.show()

##### 2. Bubble Plot of `ConvertedCompYearly` and `JobSatPoints_6` with `Age_numeric` as Bubble Size


Explore how compensation and job satisfaction are related, with age as the bubble size.


In [None]:
#df['CompYearlyFiltered']
#df['JobSatPoints_6']
#df['Age_numeric']
plt.figure(figsize = (12,7))
sns.scatterplot(data = df,
               x = 'JobSatPoints_6',
               y = 'CompYearlyFiltered',
               size = 'Age_numeric', 
               alpha = 0.4)
plt.title('Job Satisfaction Rating According to Yearly Compensation per age Group')
plt.xlabel('Job Satisfaction Rating')
plt.ylabel('Yearly Compensation in USD')

plt.show()


### Task 3: Visualizing Composition of Data with Bar Charts


##### 1. Horizontal Bar Chart of `MainBranch` Distribution


Visualize the distribution of respondents’ primary roles to understand their professional focus.



In [None]:
## Write your code here
df['MainBranch'].value_counts()

In [None]:
df['MainBranch'].value_counts().plot.barh()
plt.title('Distribution of Respondents Roles')
plt.xlabel('Role')
plt.ylabel('Count')
plt.show()

##### 2. Vertical Bar Chart of Top 5 Programming Languages Respondents Want to Work With


Identify the most desired programming languages based on `LanguageWantToWorkWith`.



In [None]:
## Write your code here
top5lng = df['LanguageWantToWorkWith'].value_counts().head(5)
top5lng

In [None]:
df5lng = df[df['LanguageWantToWorkWith'].isin(top5lng.index)]
df5lng.head()

In [None]:
df5lng['LanguageWantToWorkWith'].value_counts().plot.bar()
plt.title('Count of Respondents Most desirable prog. Languages to Work with')
plt.xlabel('Programming Language')
plt.ylabel('Count of Respondents')
plt.show()

##### 3. Stacked Bar Chart of Median `JobSatPoints_6` and `JobSatPoints_7` by Age Group


Compare job satisfaction metrics across different age groups with a stacked bar chart.


In [None]:
## Write your code here
medians = df.groupby('Age')[['JobSatPoints_6','JobSatPoints_7']].median()
medians

In [None]:
#ordering the groups for readablility
age_order = [
    'Under 18 years old',
    '18-24 years old',
    '25-34 years old',
    '35-44 years old',
    '45-54 years old',
    '55-64 years old',
    '65 years or older',
    'Prefer not to say'
]
medians = medians.loc[age_order]

In [None]:
plt.figure(figsize = (12,7))
medians.plot.bar(stacked = True)
plt.title('Job Satisfaction Rating by Age group')
plt.ylabel('Rating')
plt.xlabel('Age Group')
plt.show()

##### 4. Bar Chart of Database Popularity (`DatabaseHaveWorkedWith`)


Identify the most commonly used databases among respondents by visualizing `DatabaseHaveWorkedWith`.



In [None]:
top5db = df['DatabaseHaveWorkedWith'].value_counts().head(5)

In [None]:
db_df = df[df['DatabaseHaveWorkedWith'].isin(top5db.index)]

In [None]:
db_df['DatabaseHaveWorkedWith'].value_counts().plot.bar()
plt.title('Most Popular Data Bases')
plt.ylabel('Count of Respondents who work with the DB')
plt.xlabel('Data Base')
plt.show()

### Task 4: Visualizing Comparison of Data with Bar Charts


##### 1. Grouped Bar Chart of Median `ConvertedCompYearly` for Different Age Groups


Compare median compensation across multiple age groups with a grouped bar chart.



In [None]:
## Write your code here
median = df.groupby('Age')[['ConvertedCompYearly']].median()
median

In [None]:
plt.figure(figsize = (12,7))
median.plot.bar()
plt.title('Yearly Compensation Rating by Age group')
plt.ylabel('Yearly Comp in USD')
plt.xlabel('Age Group')
plt.show()

##### 2. Bar Chart of Respondent Count by Country


Show the distribution of respondents by country to see which regions are most represented.



In [None]:
## Write your code here
top20 = df['Country'].value_counts().head(20)
top20df = df[df['Country'].isin(top20.index)]


In [None]:
top20df['Country'].value_counts()

In [None]:
plt.figure(figsize = (12,7))
top20df.plot.bar()
plt.title('Count of Respondents by Country')
plt.ylabel('Count')
plt.xlabel('Country')
plt.show()

### Final Step: Review


This lab demonstrates how to create and interpret different types of bar charts, allowing you to analyze the composition, comparison, and distribution of categorical data in the Stack Overflow dataset, including main professional branches, programming language preferences, and compensation by age group. Bar charts effectively compare counts and median values across various categories.


## Summary


After completing this lab, you will be able to:
- Create a horizontal bar chart to visualize the distribution of respondents' primary roles, helping to understand their professional focus.
- Develop a vertical bar chart to identify the most desired programming languages based on the LanguageWantToWorkWith variable.
- Use a stacked bar chart to compare job satisfaction metrics across different age groups.
- Create a bar chart to visualize the most commonly used databases among respondents using the DatabaseHaveWorkedWith variable.


## Authors:
Ayushi Jain


### Other Contributors:
- Rav Ahuja
- Lakshmi Holla
- Malika


Copyright © IBM Corporation. All rights reserved.
