<a href="https://colab.research.google.com/github/Dasaru-t/My-Machine-Learning-Course/blob/main/Section%201-%20Python%20Crash%20Course/1_Python_Data_Structures_The_Building_Blocks_of_Data_Science.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1>Python Data Structures: The Building Blocks of Data Science</h1>
<p>In this notebook, we will dive deep into the fundamental data structures in Python: <b>Lists</b>, <b>Dictionaries</b>, <b>Sets</b>, and <b>Tuples</b>.</p>
<p>As a future Data Scientist, you won't just use these to store "apples" and "bananas." You will use them to manage datasets, configure machine learning models, and handle API responses. We will explore these with modern, relevant examples.</p>

<h2>1. Boolean Variables &amp; Logical Operators</h2>
<p>Booleans represent one of two values: <code>True</code> or <code>False</code>. In data analysis, these are crucial for filtering data (e.g., "Select rows where <code>is_active</code> is True").</p>

In [None]:
# Boolean values are often the result of comparisons
is_logged_in = True
has_premium_access = False

print(f"Is the user logged in? {is_logged_in}")
print(f"Type of variable: {type(is_logged_in)}")

# Built-in bool() function determines the "truthiness" of a value
# Empty lists, strings, or 0 are considered False. Everything else is True.
print(bool(0))          # False
print(bool(1))          # True
print(bool([]))         # False (Empty list)
print(bool("Data"))     # True

Is the user logged in? True
Type of variable: <class 'bool'>
False
True
False
True


<h3>String Methods returning Booleans</h3>
<p>When cleaning data, you often need to check the format of a string (e.g., is this ID a number? Is this name all uppercase?).</p>

In [None]:
raw_user_id = "User123"
dirty_price = "1500"
title_text = "Machine Learning Guide"

print(f"Is '{raw_user_id}' alphanumeric? {raw_user_id.isalnum()}") # Useful for ID validation
print(f"Is '{dirty_price}' a digit? {dirty_price.isdigit()}")       # Useful for cleaning numerical columns
print(f"Is '{title_text}' a title? {title_text.istitle()}")         # Useful for NLP tasks

# Logical Operators (AND, OR, NOT)
# Scenario: A user can access a dashboard if they are logged in OR have an admin key.
admin_key = False
access_granted = is_logged_in or admin_key

print(f"Access Granted: {access_granted}")

Is 'User123' alphanumeric? True
Is '1500' a digit? True
Is 'Machine Learning Guide' a title? True
Access Granted: True


<h2>2. Lists</h2>
<p>Lists are mutable (changeable), ordered sequences. In Data Science, think of a list as a column in a dataset or a collection of file paths to process.</p>

In [None]:
# Creating a list of server latency readings (in ms)
latency_readings = [120, 115, 90, 150, 200]
print(f"Original Readings: {latency_readings}")

# Lists can hold mixed data types, though in data science we usually keep them uniform
mixed_data = ["Server_A", 200, True]


Original Readings: [120, 115, 90, 150, 200]


<h3>Appending and Inserting</h3>
<p>You will often start with an empty list and fill it as you loop through data.</p>

In [None]:
# .append() adds an item to the end
latency_readings.append(95)
print(f"After appending 95: {latency_readings}")

# .insert() adds an item at a specific index
# Inserting a high priority reading at the start
latency_readings.insert(0, 300)
print(f"After inserting 300 at index 0: {latency_readings}")

# .extend() combines two lists
new_batch = [110,105]
latency_readings.extend(new_batch)
print(f"After extending with {new_batch}: {latency_readings}")

After appending 95: [120, 115, 90, 150, 200, 95]
After inserting 300 at index 0: [300, 120, 115, 90, 150, 200, 95]
After extending with [110, 105]: [300, 120, 115, 90, 150, 200, 95, 110, 105]


<h3>Indexing and Slicing (Crucial for Data Manipulation)</h3>
<p>You need to know how to grab specific parts of your data.</p>

In [None]:
# Access specific elements
print(f"First reading: {latency_readings[0]}")
print(f"Last reading: {latency_readings[-1]}")

# Slicing: [start:stop]
# Get the first 3 readings
first_three = latency_readings[0:3]
print(f"First 3 readings: {first_three}")

# Get everything after the 2nd element
after_second = latency_readings[2:]
print(f"After the 2nd element: {after_second}")



First reading: 300
Last reading: 105
First 3 readings: [300, 120, 115]
After the 2nd element: [115, 90, 150, 200, 95, 110, 105]


<h3>List Operations</h3>
<p>Quickly analyzing your list without writing loops.</p>

In [None]:
# Aggregation functions
total_latency = sum(latency_readings)
min_latency = min(latency_readings)
max_latency = max(latency_readings)
reading_count = len(latency_readings)

print(f"Total Latency: {total_latency}")
print(f"Min Latency: {min_latency}")
print(f"Max Latency: {max_latency}")
print(f"Reading Count: {reading_count}")


# .pop() removes an item (default is the last one)
removed_item = latency_readings.pop()
print(f"Removed {removed_item}. List is now: {latency_readings}")

# .count() checks frequency
# Let's add a duplicate to test
latency_readings.append(300)
print(f"Occurrences of 300: {latency_readings.count(300)}")
print(f"latency readings: {latency_readings}")

Total Latency: 1370
Min Latency: 90
Max Latency: 300
Reading Count: 8
Removed 300. List is now: [300, 120, 115, 90, 150, 200, 95]
Occurrences of 300: 2
latency readings: [300, 120, 115, 90, 150, 200, 95, 300]


<h2>3. Sets</h2>
<p>Sets are unordered collections with <b>no duplicate elements</b>. They are heavily used to find unique values (e.g., "How many <i>unique</i> visitors did we have today?").</p>

In [None]:
# Creating a set automatically removes duplicates
raw_tags = {"python", "java", "python", "c++", "java"}
print(f"Unique tags: {raw_tags}")

# Converting a list to a set to remove duplicates
user_ids = [101, 102, 101, 103, 104, 102]
unique_users = set(user_ids)
print(f"Unique IDs: {unique_users}")

# Adding elements
unique_users.add(105)

Unique tags: {'java', 'c++', 'python'}
Unique IDs: {104, 101, 102, 103}


<h3>Set Operations</h3>
<p>This is where sets shine: finding intersections (common items) and differences.</p>

In [None]:
visitors_day_1 = {"UserA", "UserB", "UserC"}
visitors_day_2 = {"UserB", "UserC", "UserD", "UserE"}

# Intersection: Who visited on BOTH days?
loyal_visitors = visitors_day_1.intersection(visitors_day_2)
print(f"Loyal Visitors: {loyal_visitors}")

# Difference: Who visited on Day 2 but NOT Day 1? (New visitors)
new_visitors = visitors_day_2.difference(visitors_day_1)
print(f"New Visitors: {new_visitors}")

# Difference Update: Remove items from one set if they exist in another
visitors_day_2.difference_update(visitors_day_1)
print(f"Day 2 Visitors (excluding Day 1 returnees: {visitors_day_2}")

Loyal Visitors: {'UserC', 'UserB'}
New Visitors: {'UserE', 'UserD'}
Day 2 Visitors (excluding Day 1 returnees: {'UserE', 'UserD'}


<h2>4. Dictionaries</h2>
<p>Dictionaries are unordered collections of key-value pairs. They are the standard format for JSON data, API responses, and configuration files.</p>

In [None]:
# A typical modern use case: storing model configurations or user profiles
model_config = {
    "model_name": "RandomForest",
    "n_estimators": 100,
    "max_depth": 5,
    "features": ["age", "income", "education"]
}

print(type(model_config))
print(model_config)

# Accessing values
print(f"Model Name: {model_config['model_name']}")

# Adding/Updating values
model_config['version'] = "1.0.2"
model_config['n_estimators'] = 200
print(model_config)

# Delete values
del model_config['max_depth']
print(model_config)

<class 'dict'>
{'model_name': 'RandomForest', 'n_estimators': 100, 'max_depth': 5, 'features': ['age', 'income', 'education']}
Model Name: RandomForest
{'model_name': 'RandomForest', 'n_estimators': 200, 'max_depth': 5, 'features': ['age', 'income', 'education'], 'version': '1.0.2'}


<h3>Looping through Dictionaries</h3>
<p>You often need to iterate through keys or values to process data.</p>

In [None]:
# Iterate through keys
print("--- Keys ---")
for key in model_config:
  print(key)

# Iterate through values
print("\n--- Values ---")
for value in model_config.values():
  print(value)

# Iterate through both (Items) - MOST USEFUL
print("\n--- Key-Value Pairs ---")
for key, value in model_config.items():
  print(f"{key}: {value}")


--- Keys ---
model_name
n_estimators
max_depth
features
version

--- Values ---
RandomForest
200
5
['age', 'income', 'education']
1.0.2

--- Key-Value Pairs ---
model_name: RandomForest
n_estimators: 200
max_depth: 5
features: ['age', 'income', 'education']
version: 1.0.2


<h3>Nested Dictionaries</h3>
<p>Real-world data is rarely flat. You will often encounter dictionaries inside dictionaries (like nested JSON).</p>

In [None]:
# Representing a dataset of employees
employees = {
    "emp_001": {
        "name": "Krish",
        "role": "Data Scientist",
        "skills": ["Python", "AI", "Cloud"]
    },
    "emp_002": {
        "name": "Sarah",
        "role": "Data Engineer",
        "skills": ["SQL", "Spark", "ETL"]
    }
}

# Accessing nested data
sarahs_role = employees['emp_002']['role']
krish_first_skill = employees['emp_001']['skills'][0]

print(f"Emp 002 Role: {sarahs_role}")
print(f"Emp 001 Primary Skill: {krish_first_skill}")

Emp 002 Role: Data Engineer
Emp 001 Primary Skill: Python


<h2>5. Tuples</h2>
<p>Tuples are similar to lists but <b>immutable</b> (cannot be changed). Use them for data that should remain constant, like geographic coordinates or database connection settings.</p>

In [None]:
# Defining a tuple
db_connection = ("192.168.1.10", 5432) # IP and Port
print(f"DB Connection: {db_connection}")
print(f"Type: {type(db_connection)}")

# You cannot change elements
# db_connection[0] = "127.0.0.1" # This would throw an error!

# Built-in methods
# .count() and .index() work just like in lists
example_tuple = ("A", "B", "C", "A")
print(f"Count of 'A': {example_tuple.count('A')}")
print(f"Index of 'B': {example_tuple.index('B')}")

DB Connection: ('192.168.1.10', 5432)
Type: <class 'tuple'>
Count of 'A': 2
Index of 'B': 1


<h3>Quick Comparison Table</h3>
<table style="border-collapse: collapse; width: 100%; border: 1px solid #dddddd;">
  <tr style="background-color: #f2f2f2; text-align: left;">
    <th style="border: 1px solid #dddddd; padding: 8px;">Data Structure</th>
    <th style="border: 1px solid #dddddd; padding: 8px;">Mutable?</th>
    <th style="border: 1px solid #dddddd; padding: 8px;">Ordered?</th>
    <th style="border: 1px solid #dddddd; padding: 8px;">Allow Duplicates?</th>
    <th style="border: 1px solid #dddddd; padding: 8px;">Syntax</th>
    <th style="border: 1px solid #dddddd; padding: 8px;">Primary Use Case</th>
  </tr>
  <tr>
    <td style="border: 1px solid #dddddd; padding: 8px;"><b>List</b></td>
    <td style="border: 1px solid #dddddd; padding: 8px;">Yes</td>
    <td style="border: 1px solid #dddddd; padding: 8px;">Yes</td>
    <td style="border: 1px solid #dddddd; padding: 8px;">Yes</td>
    <td style="border: 1px solid #dddddd; padding: 8px;"><code>[ ]</code></td>
    <td style="border: 1px solid #dddddd; padding: 8px;">Collections of data (columns, rows)</td>
  </tr>
  <tr>
    <td style="border: 1px solid #dddddd; padding: 8px;"><b>Tuple</b></td>
    <td style="border: 1px solid #dddddd; padding: 8px;"><b>No</b></td>
    <td style="border: 1px solid #dddddd; padding: 8px;">Yes</td>
    <td style="border: 1px solid #dddddd; padding: 8px;">Yes</td>
    <td style="border: 1px solid #dddddd; padding: 8px;"><code>( )</code></td>
    <td style="border: 1px solid #dddddd; padding: 8px;">Constant data (config, coords)</td>
  </tr>
  <tr>
    <td style="border: 1px solid #dddddd; padding: 8px;"><b>Set</b></td>
    <td style="border: 1px solid #dddddd; padding: 8px;">Yes</td>
    <td style="border: 1px solid #dddddd; padding: 8px;"><b>No</b></td>
    <td style="border: 1px solid #dddddd; padding: 8px;"><b>No</b></td>
    <td style="border: 1px solid #dddddd; padding: 8px;"><code>{ }</code></td>
    <td style="border: 1px solid #dddddd; padding: 8px;">Unique items, mathematical ops</td>
  </tr>
  <tr>
    <td style="border: 1px solid #dddddd; padding: 8px;"><b>Dictionary</b></td>
    <td style="border: 1px solid #dddddd; padding: 8px;">Yes</td>
    <td style="border: 1px solid #dddddd; padding: 8px;">Yes*</td>
    <td style="border: 1px solid #dddddd; padding: 8px;">No (Keys)</td>
    <td style="border: 1px solid #dddddd; padding: 8px;"><code>{k:v}</code></td>
    <td style="border: 1px solid #dddddd; padding: 8px;">Key-Value pairs, JSON, lookups</td>
  </tr>
</table>
<p><i>*Note: Dictionaries represent insertion order in modern Python (3.7+), but strictly speaking, they are accessed via keys, not index.</i></p>

<hr>

<h3>1. Lists <code>[ ]</code></h3>
<ul>
  <li><b>What it is:</b> The most versatile sequence. It works like a dynamic array that can grow and shrink.</li>
  <li><b>Key Characteristic:</b> <b>Mutable & Ordered</b>. You can change elements (<code>lst[0] = 5</code>), add new ones (<code>append</code>), and rely on their position (<code>index</code>).</li>
  <li><b>Data Science Reality:</b>
    <ul>
      <li>Lists are the backbone of data manipulation.</li>
      <li>You will use them to store columns of data from a CSV before converting them into efficient NumPy arrays or Pandas DataFrames.</li>
      <li><i>Example:</i> <code>[1.2, 3.4, 5.6]</code> (A list of model accuracy scores).</li>
    </ul>
  </li>
</ul>

<h3>2. Tuples <code>( )</code></h3>
<ul>
  <li><b>What it is:</b> A "read-only" list. Once created, it cannot be modified.</li>
  <li><b>Key Characteristic:</b> <b>Immutable</b>. This makes them faster and safer for write-protecting data.</li>
  <li><b>Data Science Reality:</b>
    <ul>
      <li>Used for data that <b>should not change</b> during program execution.</li>
      <li>Functions often return multiple values as tuples.</li>
      <li><i>Example:</i> <code>(1920, 1080)</code> (Image resolution dimensions).</li>
    </ul>
  </li>
</ul>

<h3>3. Sets <code>{ }</code></h3>
<ul>
  <li><b>What it is:</b> A collection of unique items. It works like a mathematical set.</li>
  <li><b>Key Characteristic:</b> <b>No Duplicates & Unordered</b>. It automatically removes duplicates, but you cannot access items by index (e.g., <code>set[0]</code> fails).</li>
  <li><b>Data Science Reality:</b>
    <ul>
      <li>Essential for cleaning data (removing duplicates).</li>
      <li>Used for checking membership (Is "user_123" in the prohibited list?) because it is <b>much faster</b> than checking a list.</li>
      <li><i>Example:</i> <code>{'US', 'UK', 'CA'}</code> (A set of unique country codes from a dataset).</li>
    </ul>
  </li>
</ul>

<h3>4. Dictionaries <code>{Key: Value}</code></h3>
<ul>
  <li><b>What it is:</b> A map connecting unique <b>Keys</b> to <b>Values</b>.</li>
  <li><b>Key Characteristic:</b> <b>Fast Lookup</b>. You find data by its "name" (key), not its position number.</li>
  <li><b>Data Science Reality:</b>
    <ul>
      <li>The structure of the web (JSON) and APIs.</li>
      <li>Used to map categories to numbers (Encoding).</li>
      <li><i>Example:</i> <code>{'Male': 0, 'Female': 1}</code> (Encoding categorical variables for Machine Learning).</li>
    </ul>
  </li>
</ul>