# **Exploring the Dataset**


## Objectives


-   Read the dataset into a dataframe
-   Exploring some of the data using pandas


## Load the dataset


Import the required libraries.


In [33]:
import pandas as pd

The dataset is available on the IBM Cloud at the below url.


In [34]:
dataset_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/n01PQ9pSmiRX6520flujwQ/survey-data.csv"

Load the data into a dataframe.


In [35]:
data = pd.read_csv(dataset_url)

## Explore the dataset


Display the top 5 rows and columns from the dataset.


In [36]:
data.head()

Unnamed: 0,ResponseId,MainBranch,Age,Employment,RemoteWork,Check,CodingActivities,EdLevel,LearnCode,LearnCodeOnline,...,JobSatPoints_6,JobSatPoints_7,JobSatPoints_8,JobSatPoints_9,JobSatPoints_10,JobSatPoints_11,SurveyLength,SurveyEase,ConvertedCompYearly,JobSat
0,1,I am a developer by profession,Under 18 years old,"Employed, full-time",Remote,Apples,Hobby,Primary/elementary school,Books / Physical media,,...,,,,,,,,,,
1,2,I am a developer by profession,35-44 years old,"Employed, full-time",Remote,Apples,Hobby;Contribute to open-source projects;Other...,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",Books / Physical media;Colleague;On the job tr...,Technical documentation;Blogs;Books;Written Tu...,...,0.0,0.0,0.0,0.0,0.0,0.0,,,,
2,3,I am a developer by profession,45-54 years old,"Employed, full-time",Remote,Apples,Hobby;Contribute to open-source projects;Other...,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",Books / Physical media;Colleague;On the job tr...,Technical documentation;Blogs;Books;Written Tu...,...,,,,,,,Appropriate in length,Easy,,
3,4,I am learning to code,18-24 years old,"Student, full-time",,Apples,,Some college/university study without earning ...,"Other online resources (e.g., videos, blogs, f...",Stack Overflow;How-to videos;Interactive tutorial,...,,,,,,,Too long,Easy,,
4,5,I am a developer by profession,18-24 years old,"Student, full-time",,Apples,,"Secondary school (e.g. American high school, G...","Other online resources (e.g., videos, blogs, f...",Technical documentation;Blogs;Written Tutorial...,...,,,,,,,Too short,Easy,,


## Finding out the number of rows and columns


The number of rows in the dataset.


In [37]:
print(f'There are {data.shape[0]} rows')

There are 65437 rows


The number of columns in the dataset.


In [38]:
print(f'There are {data.shape[1]} columns')

There are 114 columns


## Identifying the data types of each column


In [39]:
# Option to display all the rows in the dataframe
pd.set_option('display.max_rows', None)

Displaying the datatype of all columns.

In [40]:
data.dtypes

ResponseId                          int64
MainBranch                         object
Age                                object
Employment                         object
RemoteWork                         object
Check                              object
CodingActivities                   object
EdLevel                            object
LearnCode                          object
LearnCodeOnline                    object
TechDoc                            object
YearsCode                          object
YearsCodePro                       object
DevType                            object
OrgSize                            object
PurchaseInfluence                  object
BuyNewTool                         object
BuildvsBuy                         object
TechEndorse                        object
Country                            object
Currency                           object
CompTotal                         float64
LanguageHaveWorkedWith             object
LanguageWantToWorkWith            

## Finding the most frequent age range in the survey


In [41]:
data['Age'].describe()


count               65437
unique                  8
top       25-34 years old
freq                23911
Name: Age, dtype: object

In [42]:
data['Age'].value_counts()

Age
25-34 years old       23911
35-44 years old       14942
18-24 years old       14098
45-54 years old        6249
55-64 years old        2575
Under 18 years old     2568
65 years or older       772
Prefer not to say       322
Name: count, dtype: int64

Based on the `describe()` and `value_counts()` methods, it seems like most people who participated in the survey is from `25-34 years old`.

## Finding the average age 

In [43]:
# Defining a function to apply to 'Age' column
def avg_age(data):

    # Split the strings by the hyphen
    split = data.split('-')

    # Applying the calculation on strings with a hyphen
    if len(split) == 2:

        # Getting rid of 'years old'
        split[1] = split[1].replace("years old", "")

        # Retrieving the low and high ranges
        l = int(split[0])
        h = int(split[1])

        # Calculating the middle
        middle = (l + h) / 2

        return middle
    else:
        return None

Applying the function and calculating the mean

In [44]:
average_age = data['Age'].apply(avg_age).mean()
age = float(round(average_age, 1))
print(f'The average age is around {age} years old.')

The average age is around 33.3 years old.



## Finding the number of unique countries 

In [45]:
unique_countries = data['Country'].unique()
print(f'There are {len(unique_countries)} unique countries')

There are 186 unique countries


<!--
## Change Log
|Date (YYYY-MM-DD)|Version|Changed By|Change Description|
|-|-|-|-|
|2024-10-29|3.0|Madhusudhan|Updated lab|
|2024-09-23|2.0|Madhusudhan|Updated lab|
|2024-09-22|1.0|Raghul Ramesh|Created lab|
--!>
