# **Lab: Exploring the Dataset**


Estimated time needed: **30** minutes


## Introduction


Data exploration is the initial phase of data analysis where we aim to understand the data's characteristics, identify patterns, and uncover potential insights. It is a crucial step that helps us make informed decisions about subsequent analysis.


## Objectives


After completing this lab, you will be able to:


-   Summarize the key characteristics of a dataset.
-   Identify different data types commonly used in data analysis.


### Install the required library


In [1]:
# import micropip

# await micropip.install('pandas')

# # Import pandas after installation
import pandas as pd
print(pd.__version__)


2.2.2


## Load the dataset


<h3>Read Data</h3>
<p>
We utilize the <code>pandas.read_csv()</code> function for reading CSV files. However, in this version of the lab, which operates on JupyterLite, the dataset needs to be downloaded to the interface using the provided code below.
</p>


The functions below will download the dataset into your browser:


In [2]:
# from pyodide.http import pyfetch

# async def download(url, filename):
#     response = await pyfetch(url)
#     if response.status == 200:
#         with open(filename, "wb") as f:
#             f.write(await response.bytes())

To obtain the dataset, utilize the download() function as defined above:  


In [3]:
# await download(file_path, "survey_data.csv")
# file_name="survey_data.csv"

#### to download a file in standard Python

In [4]:
import requests

def download(url, filename):
    response = requests.get(url)
    with open(filename, "wb") as f:
        f.write(response.content)

file_path = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/n01PQ9pSmiRX6520flujwQ/survey-data.csv"
download(file_path, "survey-data.csv")


#### simply use the URL directly in the pandas.read_csv() function. You can uncomment and run the statements below

In [5]:
#df = pd.read_csv(file_path)

Utilize the Pandas method read_csv() to load the data into a dataframe.


In [6]:
df = pd.read_csv('survey-data.csv')

# Hands on Lab


## Explore the dataset


It is a good idea to print the top 5 rows of the dataset to get a feel of how the dataset will look.


Display the top 5 rows and columns from your dataset.


In [7]:
df.head()

Unnamed: 0,ResponseId,MainBranch,Age,Employment,RemoteWork,Check,CodingActivities,EdLevel,LearnCode,LearnCodeOnline,...,JobSatPoints_6,JobSatPoints_7,JobSatPoints_8,JobSatPoints_9,JobSatPoints_10,JobSatPoints_11,SurveyLength,SurveyEase,ConvertedCompYearly,JobSat
0,1,I am a developer by profession,Under 18 years old,"Employed, full-time",Remote,Apples,Hobby,Primary/elementary school,Books / Physical media,,...,,,,,,,,,,
1,2,I am a developer by profession,35-44 years old,"Employed, full-time",Remote,Apples,Hobby;Contribute to open-source projects;Other...,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",Books / Physical media;Colleague;On the job tr...,Technical documentation;Blogs;Books;Written Tu...,...,0.0,0.0,0.0,0.0,0.0,0.0,,,,
2,3,I am a developer by profession,45-54 years old,"Employed, full-time",Remote,Apples,Hobby;Contribute to open-source projects;Other...,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",Books / Physical media;Colleague;On the job tr...,Technical documentation;Blogs;Books;Written Tu...,...,,,,,,,Appropriate in length,Easy,,
3,4,I am learning to code,18-24 years old,"Student, full-time",,Apples,,Some college/university study without earning ...,"Other online resources (e.g., videos, blogs, f...",Stack Overflow;How-to videos;Interactive tutorial,...,,,,,,,Too long,Easy,,
4,5,I am a developer by profession,18-24 years old,"Student, full-time",,Apples,,"Secondary school (e.g. American high school, G...","Other online resources (e.g., videos, blogs, f...",Technical documentation;Blogs;Written Tutorial...,...,,,,,,,Too short,Easy,,


## Find out the number of rows and columns


Start by exploring the numbers of rows and columns of data in the dataset.


Print the number of rows in the dataset.


In [8]:
# Print the number of rows
print("Number of rows in the dataset:", df.shape[0])
print("Number of rows:", len(df))
print("Number of rows:", df.count().max())  # Counts non-null values in the longest column



Number of rows in the dataset: 65437
Number of rows: 65437
Number of rows: 65437


Print the number of columns in the dataset.


In [9]:
# Print the number of columns
print("Number of columns in the dataset:", df.shape[1])
print("Number of columns:", len(df.columns))
print("Number of columns:", df.columns.size)

Number of columns in the dataset: 114
Number of columns: 114
Number of columns: 114


Print the number of rows and columns in the dataset.

In [10]:
df.shape

(65437, 114)

Print the mean age of the survey participants.


In [14]:
# create a list unique ages are there in the Age column
unique_ages = df['Age'].unique().tolist()  # The .tolist() converts the result to a Python list

In [19]:
print(unique_ages)

['Under 18 years old', '35-44 years old', '45-54 years old', '18-24 years old', '25-34 years old', '55-64 years old', 'Prefer not to say', '65 years or older']


Find the midpoint of each age range: (start + end) / 2

- Under 18: We'll use 15 (midpoint between 0 and 18 is 17.5 and given that no age will be below zero, we choose an estimate for under 18 closer to zero. Using an estimate of 15)
- 18-24: 21
- 25-34: 29.5
- 35-44: 39.5
- 45-54: 49.5
- 55-64: 59.5
- Prefer not to say: np.nan
- 65 years or older: 75


In [16]:
# Define midpoints for age ranges. Using lower estimates for <18 for better representation since no age is below zero
import numpy as np

midpoints = {
    'Under 18 years old': 15,
    '18-24 years old': 21,
    '25-34 years old': 29.5,
    '35-44 years old': 39.5,
    '45-54 years old': 49.5,
    '55-64 years old': 59.5,
    '65 years or older': 70,
    'Prefer not to say': np.nan
}

In [17]:
# Map age ranges to midpoints, ensuring NaN values are handled (if any) by filling with 0
df['Midpoint'] = df['Age'].map(midpoints).fillna(0)

In [18]:
estimated_mean_age = df['Midpoint'].mean()

print(f"The estimated mean age is: {estimated_mean_age}")

The estimated mean age is: 32.80620291272522


In [29]:
# create a list unique countries are there in the country column
unique_countries = df['Country'].unique().tolist()  # The .tolist() converts the result to a Python list
# Print the number of unique countries
print("Number of unique countries:", len(unique_countries))

Number of unique countries: 186
