<p style="text-align:center">
    <a href="https://skills.network" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>


# **Lab: Exploring the Dataset**


Estimated time needed: **30** minutes


## Introduction


Data exploration is the initial phase of data analysis where we aim to understand the data's characteristics, identify patterns, and uncover potential insights. It is a crucial step that helps us make informed decisions about subsequent analysis.


## Objectives


After completing this lab, you will be able to:


-   Summarize the key characteristics of a dataset.
-   Identify different data types commonly used in data analysis.


### Install the required library


In [1]:
import micropip

await micropip.install('pandas')

# Import pandas after installation
import pandas as pd
print(pd.__version__)


Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


2.2.0


## Load the dataset


<h3>Read Data</h3>
<p>
We utilize the <code>pandas.read_csv()</code> function for reading CSV files. However, in this version of the lab, which operates on JupyterLite, the dataset needs to be downloaded to the interface using the provided code below.
</p>


The functions below will download the dataset into your browser:


In [2]:
from pyodide.http import pyfetch

async def download(url, filename):
    response = await pyfetch(url)
    if response.status == 200:
        with open(filename, "wb") as f:
            f.write(await response.bytes())

In [3]:
file_path = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/n01PQ9pSmiRX6520flujwQ/survey-data.csv"

To obtain the dataset, utilize the download() function as defined above:  


In [4]:
await download(file_path, "survey_data.csv")
file_name="survey_data.csv"

Utilize the Pandas method read_csv() to load the data into a dataframe.


In [5]:
df = pd.read_csv(file_name)

> Note: This version of the lab is working on JupyterLite, which requires the dataset to be downloaded to the interface.While working on the downloaded version of this notebook on their local machines(Jupyter Anaconda), the learners can simply **skip the steps above,** and simply use the URL directly in the `pandas.read_csv()` function. You can uncomment and run the statements in the cell below.


# Hands on Lab


## Explore the dataset


It is a good idea to print the top 5 rows of the dataset to get a feel of how the dataset will look.


Display the top 5 rows and columns from your dataset.


In [6]:
## Write your code here
df.head()

Unnamed: 0,ResponseId,MainBranch,Age,Employment,RemoteWork,Check,CodingActivities,EdLevel,LearnCode,LearnCodeOnline,...,JobSatPoints_6,JobSatPoints_7,JobSatPoints_8,JobSatPoints_9,JobSatPoints_10,JobSatPoints_11,SurveyLength,SurveyEase,ConvertedCompYearly,JobSat
0,1,I am a developer by profession,Under 18 years old,"Employed, full-time",Remote,Apples,Hobby,Primary/elementary school,Books / Physical media,,...,,,,,,,,,,
1,2,I am a developer by profession,35-44 years old,"Employed, full-time",Remote,Apples,Hobby;Contribute to open-source projects;Other...,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",Books / Physical media;Colleague;On the job tr...,Technical documentation;Blogs;Books;Written Tu...,...,0.0,0.0,0.0,0.0,0.0,0.0,,,,
2,3,I am a developer by profession,45-54 years old,"Employed, full-time",Remote,Apples,Hobby;Contribute to open-source projects;Other...,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",Books / Physical media;Colleague;On the job tr...,Technical documentation;Blogs;Books;Written Tu...,...,,,,,,,Appropriate in length,Easy,,
3,4,I am learning to code,18-24 years old,"Student, full-time",,Apples,,Some college/university study without earning ...,"Other online resources (e.g., videos, blogs, f...",Stack Overflow;How-to videos;Interactive tutorial,...,,,,,,,Too long,Easy,,
4,5,I am a developer by profession,18-24 years old,"Student, full-time",,Apples,,"Secondary school (e.g. American high school, G...","Other online resources (e.g., videos, blogs, f...",Technical documentation;Blogs;Written Tutorial...,...,,,,,,,Too short,Easy,,


## Find out the number of rows and columns


Start by exploring the numbers of rows and columns of data in the dataset.


Print the number of rows in the dataset.


In [7]:
## Write your code here
row=df.shape[0]
row

65437

Print the number of columns in the dataset.


In [9]:
## Write your code here
col=df.shape[1]
col


114

## Identify the data types of each column


Explore the dataset and identify the data types of each column.


Print the datatype of all columns.


In [10]:
## Write your code here
df.dtypes

ResponseId               int64
MainBranch              object
Age                     object
Employment              object
RemoteWork              object
                        ...   
JobSatPoints_11        float64
SurveyLength            object
SurveyEase              object
ConvertedCompYearly    float64
JobSat                 float64
Length: 114, dtype: object

In [13]:
column_data_types = [(col, df[col].dtype) for col in df.columns]
print(sorted(column_data_types, key=lambda x: x[0]))

[('AIAcc', dtype('O')), ('AIBen', dtype('O')), ('AIChallenges', dtype('O')), ('AIComplex', dtype('O')), ('AIEthics', dtype('O')), ('AINextLess integrated', dtype('O')), ('AINextMore integrated', dtype('O')), ('AINextMuch less integrated', dtype('O')), ('AINextMuch more integrated', dtype('O')), ('AINextNo change', dtype('O')), ('AISearchDevAdmired', dtype('O')), ('AISearchDevHaveWorkedWith', dtype('O')), ('AISearchDevWantToWorkWith', dtype('O')), ('AISelect', dtype('O')), ('AISent', dtype('O')), ('AIThreat', dtype('O')), ('AIToolCurrently Using', dtype('O')), ('AIToolInterested in Using', dtype('O')), ('AIToolNot interested in Using', dtype('O')), ('Age', dtype('O')), ('BuildvsBuy', dtype('O')), ('BuyNewTool', dtype('O')), ('Check', dtype('O')), ('CodingActivities', dtype('O')), ('CompTotal', dtype('float64')), ('ConvertedCompYearly', dtype('float64')), ('Country', dtype('O')), ('Currency', dtype('O')), ('DatabaseAdmired', dtype('O')), ('DatabaseHaveWorkedWith', dtype('O')), ('Database

Print the mean age of the survey participants.


In [19]:
import re
def extract_age_range(range_str):
    # Regex pattern to extract numbers from the string
    numbers = re.findall(r'\d+', range_str)
    
    if len(numbers) == 2:  # "greater than X to less than Y"
        min_age, max_age = map(int, numbers)
        return (min_age, max_age)
    elif len(numbers) == 1:  # "greater than X"
        return (int(numbers[0]), float('inf'))  # No upper limit
    else:
        return (None, None)  # Handle unexpected case

In [34]:
distinct_age = df['Age'].drop_duplicates().sort_values()
distinct_age

3        18-24 years old
14       25-34 years old
1        35-44 years old
2        45-54 years old
23       55-64 years old
48     65 years or older
30     Prefer not to say
0     Under 18 years old
Name: Age, dtype: object

In [54]:
age_midpoints = {
     'Under 18 years old': 17,
        '18-24 years old': 21,
        '25-34 years old': 29.5,
        '35-44 years old': 39.5,
        '45-54 years old': 49.5,
        '55-64 years old': 59.5,
        '65 years or older': 65,
        'Prefer not to say': None  # We'll exclude this one from the mean calculation
}

In [56]:
## Write your code here
#mean_age = df['Age'].mean()
#mean_age
df['Age_Midpoint'] = df['Age'].map(age_midpoints)
#df['age-rng']=df['Age'].apply(extract_age_range)#.tolist()
#distinct_age_ranges=df['age-rng'].drop_duplicates().sort_values()
#distinct_age_ranges
df['Age_Midpoint']
df_valid_ages = df['Age_Midpoint'].dropna()
df_valid_ages
sur_mean_age = df_valid_ages.mean()
print(f"The mean age of the survey participants is : {sur_mean_age}")

The mean age of the survey participants is : 32.9880288719957


The dataset is the result of a world wide survey. Print how many unique countries are there in the Country column.


In [46]:
df['Country'].unique()

array(['United States of America',
       'United Kingdom of Great Britain and Northern Ireland', 'Canada',
       'Norway', 'Uzbekistan', 'Serbia', 'Poland', 'Philippines',
       'Bulgaria', 'Switzerland', 'India', 'Germany', 'Ireland', 'Italy',
       'Ukraine', 'Australia', 'Brazil', 'Japan', 'Austria',
       'Iran, Islamic Republic of...', 'France', 'Saudi Arabia',
       'Romania', 'Turkey', 'Nepal', 'Algeria', 'Sweden', 'Netherlands',
       'Croatia', 'Pakistan', 'Czech Republic',
       'Republic of North Macedonia', 'Finland', 'Slovakia',
       'Russian Federation', 'Greece', 'Israel', 'Belgium', 'Mexico',
       'United Republic of Tanzania', 'Hungary', 'Argentina', 'Portugal',
       'Sri Lanka', 'Latvia', 'China', 'Singapore', 'Lebanon', 'Spain',
       'South Africa', 'Lithuania', 'Viet Nam', 'Dominican Republic',
       'Indonesia', 'Kosovo', 'Morocco', 'Taiwan', 'Georgia',
       'San Marino', 'Tunisia', 'Bangladesh', 'Nigeria', 'Liechtenstein',
       'Denmark', 'Ecu

In [48]:
len(df['Country'].dropna().unique())

185

In [45]:
## Write your code here
country_counts = df['Country'].value_counts()
print(country_counts)

column_data_types = [(Country, col) for Country, col in df['Country'].value_counts().items()]
print(column_data_types)

Country
United States of America                                11095
Germany                                                  4947
India                                                    4231
United Kingdom of Great Britain and Northern Ireland     3224
Ukraine                                                  2672
                                                        ...  
Central African Republic                                    1
Equatorial Guinea                                           1
Niger                                                       1
Guinea                                                      1
Solomon Islands                                             1
Name: count, Length: 185, dtype: int64
[('United States of America', 11095), ('Germany', 4947), ('India', 4231), ('United Kingdom of Great Britain and Northern Ireland', 3224), ('Ukraine', 2672), ('France', 2110), ('Canada', 2104), ('Poland', 1534), ('Netherlands', 1449), ('Brazil', 1375), ('Italy', 1341), ('

Copyright ©  IBM Corporation. All rights reserved.
