<a href="https://colab.research.google.com/github/Dasaru-t/My-Machine-Learning-Course/blob/main/Section%201-%20Python%20Crash%20Course/1_Python_Data_Structures_The_Building_Blocks_of_Data_Science.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1>Python Data Structures: The Building Blocks of Data Science</h1>
<p>In this notebook, we will dive deep into the fundamental data structures in Python: <b>Lists</b>, <b>Dictionaries</b>, <b>Sets</b>, and <b>Tuples</b>.</p>
<p>As a future Data Scientist, you won't just use these to store "apples" and "bananas." You will use them to manage datasets, configure machine learning models, and handle API responses. We will explore these with modern, relevant examples.</p>

<h2>1. Boolean Variables &amp; Logical Operators</h2>
<p>Booleans represent one of two values: <code>True</code> or <code>False</code>. In data analysis, these are crucial for filtering data (e.g., "Select rows where <code>is_active</code> is True").</p>

In [2]:
# Boolean values are often the result of comparisons
is_logged_in = True
has_premium_access = False

print(f"Is the user logged in? {is_logged_in}")
print(f"Type of variable: {type(is_logged_in)}")

# Built-in bool() function determines the "truthiness" of a value
# Empty lists, strings, or 0 are considered False. Everything else is True.
print(bool(0))          # False
print(bool(1))          # True
print(bool([]))         # False (Empty list)
print(bool("Data"))     # True

Is the user logged in? True
Type of variable: <class 'bool'>
False
True
False
True


<h3>String Methods returning Booleans</h3>
<p>When cleaning data, you often need to check the format of a string (e.g., is this ID a number? Is this name all uppercase?).</p>

In [3]:
raw_user_id = "User123"
dirty_price = "1500"
title_text = "Machine Learning Guide"

print(f"Is '{raw_user_id}' alphanumeric? {raw_user_id.isalnum()}") # Useful for ID validation
print(f"Is '{dirty_price}' a digit? {dirty_price.isdigit()}")       # Useful for cleaning numerical columns
print(f"Is '{title_text}' a title? {title_text.istitle()}")         # Useful for NLP tasks

# Logical Operators (AND, OR, NOT)
# Scenario: A user can access a dashboard if they are logged in OR have an admin key.
admin_key = False
access_granted = is_logged_in or admin_key

print(f"Access Granted: {access_granted}")

Is 'User123' alphanumeric? True
Is '1500' a digit? True
Is 'Machine Learning Guide' a title? True
Access Granted: True


<h2>2. Lists</h2>
<p>Lists are mutable (changeable), ordered sequences. In Data Science, think of a list as a column in a dataset or a collection of file paths to process.</p>

In [4]:
# Creating a list of server latency readings (in ms)
latency_readings = [120, 115, 90, 150, 200]
print(f"Original Readings: {latency_readings}")

# Lists can hold mixed data types, though in data science we usually keep them uniform
mixed_data = ["Server_A", 200, True]


Original Readings: [120, 115, 90, 150, 200]


<h3>Appending and Inserting</h3>
<p>You will often start with an empty list and fill it as you loop through data.</p>

In [5]:
# .append() adds an item to the end
latency_readings.append(95)
print(f"After appending 95: {latency_readings}")

# .insert() adds an item at a specific index
# Inserting a high priority reading at the start
latency_readings.insert(0, 300)
print(f"After inserting 300 at index 0: {latency_readings}")

# .extend() combines two lists
new_batch = [110,105]
latency_readings.extend(new_batch)
print(f"After extending with {new_batch}: {latency_readings}")

After appending 95: [120, 115, 90, 150, 200, 95]
After inserting 300 at index 0: [300, 120, 115, 90, 150, 200, 95]
After extending with [110, 105]: [300, 120, 115, 90, 150, 200, 95, 110, 105]


<h3>Indexing and Slicing (Crucial for Data Manipulation)</h3>
<p>You need to know how to grab specific parts of your data.</p>

In [6]:
# Access specific elements
print(f"First reading: {latency_readings[0]}")
print(f"Last reading: {latency_readings[-1]}")

# Slicing: [start:stop]
# Get the first 3 readings
first_three = latency_readings[0:3]
print(f"First 3 readings: {first_three}")

# Get everything after the 2nd element
after_second = latency_readings[2:]
print(f"After the 2nd element: {after_second}")



First reading: 300
Last reading: 105
First 3 readings: [300, 120, 115]
After the 2nd element: [115, 90, 150, 200, 95, 110, 105]


<h3>List Operations</h3>
<p>Quickly analyzing your list without writing loops.</p>

In [10]:
# Aggregation functions
total_latency = sum(latency_readings)
min_latency = min(latency_readings)
max_latency = max(latency_readings)
reading_count = len(latency_readings)

print(f"Total Latency: {total_latency}")
print(f"Min Latency: {min_latency}")
print(f"Max Latency: {max_latency}")
print(f"Reading Count: {reading_count}")


# .pop() removes an item (default is the last one)
removed_item = latency_readings.pop()
print(f"Removed {removed_item}. List is now: {latency_readings}")

# .count() checks frequency
# Let's add a duplicate to test
latency_readings.append(300)
print(f"Occurrences of 300: {latency_readings.count(300)}")
print(f"latency readings: {latency_readings}")

Total Latency: 1370
Min Latency: 90
Max Latency: 300
Reading Count: 8
Removed 300. List is now: [300, 120, 115, 90, 150, 200, 95]
Occurrences of 300: 2
latency readings: [300, 120, 115, 90, 150, 200, 95, 300]


<h2>3. Sets</h2>
<p>Sets are unordered collections with <b>no duplicate elements</b>. They are heavily used to find unique values (e.g., "How many <i>unique</i> visitors did we have today?").</p>

In [13]:
# Creating a set automatically removes duplicates
raw_tags = {"python", "java", "python", "c++", "java"}
print(f"Unique tags: {raw_tags}")

# Converting a list to a set to remove duplicates
user_ids = [101, 102, 101, 103, 104, 102]
unique_users = set(user_ids)
print(f"Unique IDs: {unique_users}")

# Adding elements
unique_users.add(105)

Unique tags: {'java', 'c++', 'python'}
Unique IDs: {104, 101, 102, 103}


<h3>Set Operations</h3>
<p>This is where sets shine: finding intersections (common items) and differences.</p>

In [14]:
visitors_day_1 = {"UserA", "UserB", "UserC"}
visitors_day_2 = {"UserB", "UserC", "UserD", "UserE"}

# Intersection: Who visited on BOTH days?
loyal_visitors = visitors_day_1.intersection(visitors_day_2)
print(f"Loyal Visitors: {loyal_visitors}")

Loyal Visitors: {'UserC', 'UserB'}
