<p style="text-align:center">
    <a href="https://skills.network" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>


# **Data Visualization**


Estimated time needed: **45** minutes


In this lab, you will focus on data visualization. The dataset will be provided through an RDBMS, and you will need to use SQL queries to extract the required data.


## Objectives


After completing this lab, you will be able to:


-   Visualize the distribution of data.

-   Visualize the relationship between two features.

-   Visualize composition and comparison of data.




## Demo: How to work with database


Download the database file.


In [1]:
!wget https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/n01PQ9pSmiRX6520flujwQ/survey-data.csv

--2025-09-06 19:33:58--  https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/n01PQ9pSmiRX6520flujwQ/survey-data.csv
Resolving cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)... 169.63.118.104
Connecting to cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud (cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud)|169.63.118.104|:443... connected.
200 OKequest sent, awaiting response... 
Length: 159525875 (152M) [text/csv]
Saving to: ‘survey-data.csv.2’


2025-09-06 19:34:02 (68.2 MB/s) - ‘survey-data.csv.2’ saved [159525875/159525875]



**Install and Import Necessary Python Libraries**

Ensure that you have the required libraries installed to work with SQLite and Pandas:


In [2]:
!pip install pandas 
!pip install matplotlib

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np



**Read the CSV File into a Pandas DataFrame**

Load the Stack Overflow survey data into a Pandas DataFrame:


In [3]:
# Read the CSV file
df = pd.read_csv('survey-data.csv')

# Display the first few rows of the data
df.head()


Unnamed: 0,ResponseId,MainBranch,Age,Employment,RemoteWork,Check,CodingActivities,EdLevel,LearnCode,LearnCodeOnline,...,JobSatPoints_6,JobSatPoints_7,JobSatPoints_8,JobSatPoints_9,JobSatPoints_10,JobSatPoints_11,SurveyLength,SurveyEase,ConvertedCompYearly,JobSat
0,1,I am a developer by profession,Under 18 years old,"Employed, full-time",Remote,Apples,Hobby,Primary/elementary school,Books / Physical media,,...,,,,,,,,,,
1,2,I am a developer by profession,35-44 years old,"Employed, full-time",Remote,Apples,Hobby;Contribute to open-source projects;Other...,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",Books / Physical media;Colleague;On the job tr...,Technical documentation;Blogs;Books;Written Tu...,...,0.0,0.0,0.0,0.0,0.0,0.0,,,,
2,3,I am a developer by profession,45-54 years old,"Employed, full-time",Remote,Apples,Hobby;Contribute to open-source projects;Other...,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",Books / Physical media;Colleague;On the job tr...,Technical documentation;Blogs;Books;Written Tu...,...,,,,,,,Appropriate in length,Easy,,
3,4,I am learning to code,18-24 years old,"Student, full-time",,Apples,,Some college/university study without earning ...,"Other online resources (e.g., videos, blogs, f...",Stack Overflow;How-to videos;Interactive tutorial,...,,,,,,,Too long,Easy,,
4,5,I am a developer by profession,18-24 years old,"Student, full-time",,Apples,,"Secondary school (e.g. American high school, G...","Other online resources (e.g., videos, blogs, f...",Technical documentation;Blogs;Written Tutorial...,...,,,,,,,Too short,Easy,,


**Create a SQLite Database and Insert the Data**

Now, let's create a new SQLite database (`survey-data.sqlite`) and insert the data from the DataFrame into a table using the sqlite3 library:


In [None]:
import sqlite3

# Create a connection to the SQLite database
conn = sqlite3.connect('survey-data.sqlite')

# Write the dataframe to the SQLite database
df.to_sql('main', conn, if_exists='replace', index=False)


# Close the connection
conn.close()


**Verify the Data in the SQLite Database**
Verify that the data has been correctly inserted into the SQLite database by running a simple query:


In [None]:
# Reconnect to the SQLite database
conn = sqlite3.connect('survey-data.sqlite')

# Run a simple query to check the data
QUERY = "SELECT * FROM main LIMIT 5"
df_check = pd.read_sql_query(QUERY, conn)

# Display the results
print(df_check)


## Demo: Running an SQL Query


Count the number of rows in the table named 'main'


In [None]:
QUERY = """
SELECT COUNT(*) 
FROM main
"""
df = pd.read_sql_query(QUERY, conn)
df.head()


## Demo: Listing All Tables


To view the names of all tables in the database:


In [None]:
QUERY = """
SELECT name as Table_Name FROM sqlite_master 
WHERE type = 'table'
"""
pd.read_sql_query(QUERY, conn)


## Demo: Running a Group By Query
    
For example, you can group data by a specific column, like Age, to get the count of respondents in each age group:


In [None]:
QUERY = """
SELECT Age, COUNT(*) as count
FROM main
GROUP BY Age
ORDER BY Age
"""
pd.read_sql_query(QUERY, conn)


## Demo: Describing a table

Use this query to get the schema of a specific table, main in this case:


In [None]:
table_name = 'main'

QUERY = """
SELECT sql FROM sqlite_master 
WHERE name= '{}'
""".format(table_name)

df = pd.read_sql_query(QUERY, conn)
print(df.iat[0,0])


## Hands-on Lab


### Visualizing the Distribution of Data

**Histograms**

Plot a histogram of CompTotal (Total Compensation).


In [None]:
## Write your code here
table='main'

Query='''
SELECT CompTotal
FROM main
'''
df = pd.read_sql_query(Query, conn)

df.plot(kind='hist',
        bins=60,
        figsize=(10,6)
       )
plt.title('Histogram for the Total Compensation from Dataset')
plt.xlabel('Total Compensation')
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()

**Box Plots**

Plot a box plot of Age.


In [None]:
## Write your code here

Query = '''
SELECT Age
FROM main
'''

df_age = pd.read_sql_query(Query, conn)

# Mapping labels for correlation
age_numeric_map = {
    'Under 18 years old': 15,
    '18-24 years old': 20,
    '25-34 years old': 30,
    '35-44 years old': 40,
    '45-54 years old': 50,
    '55-64 years old': 60,
    '65 years or older': 70,
    'Prefer not to say': np.nan
}

# Applying mapping to data (swapping original column values) and cleaning missing values
df_age['Age'] = df_age['Age'].map(age_numeric_map)
cleaned_df = df_age.dropna()

cleaned_df = cleaned_df.rename(columns={'Age':''})

cleaned_df.plot(kind='box',
            figsize=(10, 6)
)
plt.title('Box Plot of Age from Dataset')
plt.xticks([])
plt.ylabel('Age')
plt.show()

### Visualizing Relationships in Data

**Scatter Plots**

Create a scatter plot of Age and WorkExp.


In [None]:
## Write your code here
Query = '''
SELECT Age, WorkExp
FROM main
'''
df = pd.read_sql_query(Query, conn)

# Mapping labels for correlation
age_numeric_map = {
    'Under 18 years old': 15,
    '18-24 years old': 20,
    '25-34 years old': 30,
    '35-44 years old': 40,
    '45-54 years old': 50,
    '55-64 years old': 60,
    '65 years or older': 70,
    'Prefer not to say': np.nan
}

# Applying mapping to data (swapping original column values) and cleaning missing values
df['Age'] = df['Age'].map(age_numeric_map)
cleaned_df = df.dropna()

plt.scatter(x=cleaned_df['Age'], y=cleaned_df['WorkExp'], alpha = 0.5)
plt.title('Age vs. Work Experience')
plt.xlabel('Age')
plt.ylabel('Work Experience')
plt.show()

**Bubble Plots**

Create a bubble plot of `TimeSearching` and `Frustration` using the Age column as the bubble size.


In [None]:
## Write your code here
query = '''
SELECT TimeSearching, Frustration, Age
FROM main
'''

df = pd.read_sql_query(query, conn)

# Mapping labels for correlation
age_numeric_map = {
    'Under 18 years old': 15,
    '18-24 years old': 20,
    '25-34 years old': 30,
    '35-44 years old': 40,
    '45-54 years old': 50,
    '55-64 years old': 60,
    '65 years or older': 70,
    'Prefer not to say': np.nan
}

# Applying mapping to data (swapping original column values) and cleaning missing values
df['Age'] = df['Age'].map(age_numeric_map)
clean_df = df.dropna()

clean_df.loc[:, 'Frustration'] = clean_df['Frustration'].str.split(';')
exploded_df = clean_df.explode('Frustration')

exploded_df = exploded_df[exploded_df['Frustration'] != 'None of these']

plt.figure(figsize=(15,8))
plt.scatter(x= exploded_df['Frustration'],
            y= exploded_df['TimeSearching'],
            s= exploded_df['Age']*5,
            alpha=0.5
           )
plt.title('Frustration Type vs. Time Searching (Bubble size = Age)')
plt.xlabel('Frustration')
plt.ylabel('Time Searching')
plt.xticks(rotation=45, ha='right')
plt.grid(True)
plt.tight_layout()
plt.show()

### Visualizing Composition of Data

**Pie Charts**

Create a pie chart of the top 5 databases(`DatabaseWantToWorkWith`) that respondents wish to learn next year.


In [None]:
## Write your code here
# Setting query and pulling data from db
query='''
SELECT DatabaseWantToWorkWith
from main 
'''
df = pd.read_sql_query(query, conn)

# Checking values available
df.value_counts()

# Splitting multiselect options and exploding data
df['DatabaseWantToWorkWith'] = df['DatabaseWantToWorkWith'].str.split(';')
exploded_df = df['DatabaseWantToWorkWith'].explode()

# Getting top 5 of values
full_count_df = exploded_df.value_counts().head(5)

# Plotting
full_count_df.plot(kind='pie',
                   figsize=(10,6),
                   autopct='%1.1f%%',
                   startangle=90
                  )
plt.title('Top 5 Databases Respondent Wanted To Work With')
plt.xlabel('')
plt.ylabel('')
plt.show()

**Stacked Charts** 

Create a stacked bar chart of median `TimeSearching` and `TimeAnswering` for the age group 30 to 35.


In [None]:
## Write your code here
# Setting up query and pulling from db
query = '''
SELECT TimeSearching, TimeAnswering
FROM main
WHERE Age = '25-34 years old'
'''
df=pd.read_sql_query(query, conn)

# Checking data for missing values and full shape
print('Number of missing values in data:\n', df.isnull().sum())
print('\nShape of the data is:', df.shape)
print('\nUnique values and total count is:\n', df['TimeSearching'].value_counts())
print()
print('\nUnique values and total count is:\n', df['TimeAnswering'].value_counts())

# Relabeling data to average numerical
time_map = {
    'Less than 15 minutes a day': 7.5,
    '15-30 minutes a day': 22.5,
    '30-60 minutes a day': 45,
    '60-120 minutes a day': 90,
    'Over 120 minutes a day': 150
}
df['TimeSearching'] = df['TimeSearching'].map(time_map)
df['TimeAnswering'] = df['TimeAnswering'].map(time_map)

# Cleaning data
clean_data = df.dropna()
search_median = df['TimeSearching'].median()
answer_median = df['TimeAnswering'].median()

print(f'\nMedians for Time Searching and Time Answering are: {search_median} and {answer_median} \n')

plt.bar('25-34' ,search_median, label='Time Searching', color='skyblue')
plt.bar('25-34', answer_median, bottom=search_median, label='Answering Time', color='lightcoral')
plt.text('25-34', search_median / 2, f'{search_median} min', ha='center')
plt.text('25-34', search_median + answer_median / 2, f'{answer_median} min', ha='center')
plt.title('Searching vs Answering Time for Respondants')
plt.xlabel('Categories')
plt.ylabel('Time (Median)')
plt.show()

### Visualizing Comparison of Data

**Line Chart**

Plot the median `CompTotal` for all ages from 45 to 60.


In [None]:
## Write your code here
# Pulling data from db
query = '''
SELECT Age, CompTotal
FROM main
WHERE Age IN ('45-54 years old', '55-64 years old')
'''
df = pd.read_sql_query(query, conn)

# Data checking
print('Count of values available in df are:\n', df.value_counts())
print('\nTotal available data rows are:', df.shape)
print('\nTotal missing data is:', df.isnull().sum())

#Calculating medians for groups
young_group = df[df['Age'] == '45-54 years old']
young_group = young_group.dropna()
young_median = young_group['CompTotal'].median()
old_group = df[df['Age'] == '55-64 years old']
old_group = old_group.dropna()
old_median = old_group['CompTotal'].median()

# Assigning values
x_values = np.array(['45-55', '55-64'])
y_values = np.array([young_median, old_median])

# Printing and Plotting data
print('\nMedian for age group 45-54 years old is:', young_median)
print('Median for age group 55-64 years old is:', old_median)

plt.plot(x_values, y_values)
plt.title('Progression of Compensation Total by Age')
plt.xlabel('Age Group (Years)')
plt.ylabel('Total Compensation (Median)')
plt.show()

**Bar Chart**

Create a horizontal bar chart using the `MainBranch` column.


In [None]:
## Write your code here
# Setting up and pulling data from database
query= '''
SELECT MainBranch
FROM main
'''
df = pd.read_sql_query(query, conn)

# Checking available values
print(df.value_counts())
print('\n',df.isnull().sum())

occupation_labels = {
    'I am a developer by profession': 'Pro. Dev.',
    'I am not primarily a developer, but I write code sometimes as part of my work/studies': 'Occasional Coder',
    'I am learning to code': 'Beginner/Learner',
    'I code primarily as a hobby': 'Coding as a Hobby',
    'I used to be a developer by profession, but no longer am': 'Ex-Pro. Dev.'
}

renamed_df = df['MainBranch'].map(occupation_labels).value_counts()

renamed_df.sort_values(ascending=True).plot(kind='barh', figsize=(10,6))
plt.title('Respondants Professional Stages of Development in Programming')
plt.xlabel('Count')
plt.ylabel('Professional Stage')
plt.show()

### Summary


In this lab, you focused on extracting and visualizing data from an RDBMS using SQL queries and SQLite. You applied various visualization techniques, including:

- Histograms to display the distribution of CompTotal.
- Box plots to show the spread of ages.
- Scatter plots and bubble plots to explore relationships between variables like Age, WorkExp, `TimeSearching` and `TimeAnswering`.
- Pie charts and stacked charts to visualize the composition of data.
- Line charts and bar charts to compare data across categories.


### Close the Database Connection

Once the lab is complete, ensure to close the database connection:


In [None]:
conn.close()

## Authors:
Ayushi Jain


### Other Contributors:
- Rav Ahuja
- Lakshmi Holla
- Malika


Copyright © IBM Corporation. All rights reserved.
