# Creating variables

# Week 0: Python Fundamentals

## Learning Objectives

By the end of this session, you will be able to:

- ‚úì Create and manipulate Python variables (strings, numbers)
- ‚úì Control program flow using conditionals (if/elif/else) and loops (for, while)
- ‚úì Work with data structures: lists, tuples, sets, and dictionaries
- ‚úì Define and use functions effectively
- ‚úì Read and write files using Python
- ‚úì Load and analyze data with Pandas DataFrames
- ‚úì Perform numerical operations with NumPy arrays
- ‚úì Create visualizations with Matplotlib (pie charts, histograms, boxplots, scatterplots)
- ‚úì Apply basic machine learning with scikit-learn

**Estimated Time:** 3-4 hours  
**Prerequisites:** None - this is a beginner-friendly introduction  
**Relevance:** These skills form the foundation for web scraping, clickstream analysis, network analysis, and recommendation systems in later weeks.

---

## Setup & Installation

Before we begin, let's ensure all required libraries are installed. Run the cell below to install the packages we'll use throughout this course.

In [None]:
# Install required packages
# Note: If running in Jupyter, you may need to restart the kernel after installation
!pip install numpy pandas scikit-learn matplotlib --quiet

# Verify installation
import numpy as np
import pandas as pd
import sklearn
import matplotlib.pyplot as plt

print("‚úì NumPy version:", np.__version__)
print("‚úì Pandas version:", pd.__version__)
print("‚úì Scikit-learn version:", sklearn.__version__)
print("‚úì Matplotlib version:", plt.matplotlib.__version__)
print("\n‚úÖ All packages installed successfully!")

---

# Python Fundamentals

Now let's dive into Python basics! We'll start with variables and gradually build up to more complex concepts.

In this notebook we will look into the concept of variables.

Python, like R, is a dynamically-typed language, meaning you can change the class/type of a variable on the go. This is convenient in many places, but dangerous in many other ways. It is impossible to rely on the type of the variable, and you should always retrace your steps throughout the code to see what the variable is currently representing. This is sometimes hard, especially in places like this notebook where we can execute different bits of code in any order.

## Intro and strings

Let's create a variable:

In [None]:
name = "Edinburgh"
name

This generates a string variable. They can be easily printed, although it is safer to use the print function:

In [None]:
print(name)

It is also wise to check the type of the variable, in case you are lost:

In [None]:
type(name)

This confirms that we are dealing with a string. There are a few things we can do with strings (which we can denote by using one or two apostrophes):

In [None]:
name = 'university of edinburgh'
print(name.lower())
print(name.upper())
print(name.title())

We can concatenate strings easily using +, or using a comma in a print statement:

In [None]:
print('University', 'of Edinburgh')
print('University' + ' ' + 'of Edinburgh')

Writing print('The University of Edinburgh is '+ 439) will not work, as the + operator only works for strings, we can convert any object into a string however:

In [None]:
print('The University of Edinburgh is '+ str(439))

A few other useful tricks:

In [None]:
name = " edinburgh "
print("|"+name.lstrip()+"|")
print("|"+name.rstrip()+"|")
print("|"+name.strip()+"|")

You can use control characters as well:

In [None]:
print('Edinburgh\thas a university\nrunning web & social network analytics course')

## Numbers

In [None]:
a = 10
b = -10.1023

#Some operations illustrated (\t stands for a tab)
print("a: \t\t\t" + str(a))
print("b: \t\t\t" + str(b))
print("absolute of b: \t\t" + str(abs(b)))
print("rounded b: \t\t" + str(round(b,3)))
print("square of a: \t\t" + str(pow(a,2)))
print("cube of a: \t\t" + str(a**3))
print("integer part of b: \t" + str(int(b)))

# Flow Control

Control flow statements help you to structure the code and direct it towards your convenience and introduce loops and so on.

## If statements

In [None]:
price = -5;

if price <0:
    print("Price is negative!")
elif price <1:
    print("Price is too small!")
else:
    print("Price is suitable.")

Especially in text mining, comparing strings is very important:

In [None]:
#Comparing strings
name1 = "edinburgh"
name2 = "Edinburgh"

if name1 == name2:
    print("Equal")
else:
    print("Not equal")

if name1.lower() == name2.lower():
    print("Equal")
else:
    print("Not equal")

Using multiple conditions:

In [None]:
number = 9
if number > 1 and not number > 9:
    print("Number is between 1 and 10")
    
number = 9
name = 'johannes'
if number < 5 or 'j' in name:
    print("Number is lower than 5 or the name contains a 'j'")

## While loops

In [None]:
number = 4
while number > 1:
    print(number)
    number = number -1

## For loops

For loops allow you to iteratre over elements in a certain collection, for example a list:

In [None]:
# We'll look into lists in a minute
number_list = [1, 2, 3, 4]
for item in number_list:
    print(item)

In [None]:
list = ['a', 'b', 'c']
for item in list:
    print(item)

Ranges are also useful. Note that the upper element is not included and we can adjust the step size:

In [None]:
for i in range(1,4):
    print(i)

In [None]:
for i in range(30,100, 10):
    print(i)

## Indentation

Please be very careful with indentation

In [None]:
number_1 = 3
number_2 = 5

print('No indent (no tabs used)')
if number_1 > 1:
    print('\tNumber 1 higher than 1.')
    if number_2 > 5:
        print('\t\tnumber 2 higher than 5')
    print('\tnumber 2 higher than 5')

number_1 = 3
number_2 = 6

print('No indent (no tabs used)')
if number_1 > 1:
    print('\tNumber 1 higher than 1.')
    if number_2 > 5:
        print('\t\tnumber 2 higher than 5')
    print('\tnumber 2 higher than 5')

# List & Tuple

## Lists

Lists are great for collecting anything. They can contain objects of different types. For example:

In [None]:
names = [5, "Giovanni", "Rose", "Yongzhe", "Luciana", "Imani"]

Although that is not best practice. Let's start with a list of names:

In [None]:
names = ["Johannes", "Giovanni", "Rose", "Yongzhe", "Luciana", "Imani"]

In [None]:
# Loop names
for name in names:
    print('Name: '+name)

# Get 'Giovanni' from list
# Lists start counting at 0
giovanni = names[1]
print(giovanni.upper())

# Get last item
name = names[-1]
print(name.upper())

# Get second to last item
name = names[-2]
print(name.upper())

print("First three: "+str(names[0:3]))
print("First four: "+str(names[:4]))
print("Up until the second to last one: "+str(names[:-2]))
print("Last two: "+str(names[-2:]))

## Enumeration

We can enumerate collections/lists that adds an index to every element:

In [None]:
for index, name in enumerate(names):
    print(str(index) , " " , name, " is in the list.")

## Searching and editing

In [None]:
names = ["Johannes", "Giovanni", "Rose", "Yongzhe", "Luciana", "Imani"]

# Finding an element
print(names.index("Johannes"))

# Adding an element
names.append("Kumiko")

# Adding an element at a specific location
names.insert(2, "Roberta")

print(names)

#Removal
fruits = ["apple","orange","pear"]
del fruits[0]
fruits.remove("pear")
print('Fruits: ', fruits)

# Modifying an element
names[5] = "Tom"
print(names)

# Test whether an item is in the list (best do this before removing to avoid raising errors)
print("Tom" in names)

# Length of a list
print("Length of the list: " + str(len(names)))

Python starts at 0!!!

## Sorting and copying

In [None]:
# Temporary sorting:
print(sorted(names))
print(names)

# Make changes permanent
names.sort()
print("Sorted names: " + str(names))
names.sort(reverse=True)
print("Reverse sorted names: " + str(names))

In [None]:
# Copying list (a shallow copy just duplicates the pointer to the memory address)
namez = names
namez.remove("Johannes")
print(namez)
print(names)

# Now a 'deep' copy
print("After deep copy")

namez = names.copy()
namez.remove("Giovanni")
print(namez)
print(names)

#Alternative
namez = names[:]
print(namez)

## Strings as lists

Strings can be manipulated and used just like lists. This is especially handy in text mining:

In [None]:
course = "Predictive analytics"
print("Last nine letters: "+course[-9:])
print("Analytics in course title? " + str("analytics" in course))
print("Start location of 'analytics': " + str(course.find("analytics")))
print(course.replace("analytics","analysis"))
list_of_words = course.split(" ")
for index, word in enumerate(list_of_words):
    print("Word ", index, ": "+word)

## Sets

Sets only contain unique elements. They have to be declared upfront using set() and allow for operations such as intersection():

In [None]:
name_set = set(names)
print(name_set)

# Add an element
name_set.add("Galina")
print(name_set)

# Discard an element
name_set.discard("Johannes")
print(name_set)

name_set2 = set(["Rose", "Tom"])
# Difference and intersection
difference = name_set - name_set2
print(difference)
intersection = name_set.intersection(name_set2)
print(intersection)

# Dictionary & Function

## Dictionaries

Dictionaries are a great way to store particular data as key-value pairs, which mimics the basic structure of a simple database.

In [None]:
courses = {"Johannes" : "Predictive analytics", "Kumiko" : "Prescriptive analytics", "Luciana" : "Descriptive analytics"}

for organizer in courses:
    print(organizer + " teaches " + courses[organizer])

We can also write:

In [None]:
for organizer, course in courses.items():
    print(organizer + " teaches " + course)

In [None]:
# Adding items
courses["Imani"] = "Other analytics"
print(courses)

# Overwrite
courses["Johannes"] = "Business analytics"
print(courses)

In [None]:
# Remove
del courses["Johannes"]
print(courses)

In [None]:
# Looping values
for course in courses.values():
    print(course)

In [None]:
# Sorted output (on keys)
for organizer, course in sorted(courses.items()):
    print(organizer +" teaches " + course)

## Functions

Functions form the backbone of all code. You have already used some, like print(). They can be easily defined by yourself as well.

In [None]:
def my_function(a, b):
    a = a.title()
    b = b.upper()
    print(a+ " "+b)

In [None]:
def my_function2(a, b):
    a = a.title()
    b = b.upper()
    return a + " " + b

In [None]:
my_function("johannes","de smedt")
output = my_function2("johannes","de smedt")
print(output)

Notice how the first function already prints, while the second returns a string we have to print ourselves. Python is weakly-typed, so a function can produce different results, like in this example:

In [None]:
# Different output type
def calculate_mean(a, b):
    if (a>0):
        return (a+b)/2
    else:
        return "a is negative"

output = calculate_mean(1,2)
print(output)
output = calculate_mean(0,1)
print(output)

## Comprehensions

Comprehensions allow you to quickly/efficiently write lists/dictionaries:

In [None]:
# Finding even numbers
evens = [i for i in range(1,11) if i % 2 ==0]
print(evens)

In Python, you can easily make tuples such as pairs, like here:

In [None]:
# Double fun
pairs = [(x,y) for x in range(1,11) for y in range(5,11) if x>y]
print(pairs)

They are also useful to perform some pre-processing, e.g., on strings:

In [None]:
# Operations
names = ["jamal", "maurizio", "johannes"]

titled_names = [name.title() for name in names]
print(titled_names)

j_s = [name.title() for name in names if name.lower()[0] == 'j']
print(j_s)

# IO & Library

## Reading files

In Python, we can easily open any file type. Naturally, it is most suitable for plainly-structured formats such as .txt., .csv., as so on. You can also open Excel files with appropriate packages, such as pandas (more on this later). Let's read in a .csv file:

In [None]:
# Open a file for reading ('r')
file = open('data/DM_1.csv','r')

for line in file:
    print(line)

We can store this information in objects and start using it:

In [None]:
# File is looped now, hence, reread file
file = open('data/DM_1.csv','r')
# ignore the header
next(file)

# Store names with amount (i.e. columns 1 & 2)
amount_per_person = {}
for line in file:
    cells = line.split(",")
    amount_per_person[cells[0]] = int(cells[3])

for person, amount in sorted(amount_per_person.items()):
    if amount > 25000:
        print(person , " has " , amount)

In [None]:
# Now we use 'w' for write   
output_file = open('data/ordered_amounts_per_person.csv','w')

for person, amount in sorted(amount_per_person.items()):
    output_file.write(person.lower()+","+str(amount))    
output_file.close()

## Libraries

Libraries are imported by using `import`:

In [None]:
import numpy
import pandas
import sklearn

If you haven't installed sklearn, please install it by:

In [None]:
!pip install scikit-learn

We can import just a few bits using `from`, or create aliases using `as`:

In [None]:
import math as m
from math import pi

In [None]:
print(numpy.add(1, 2))
print(pi)
print(m.sin(1))

In the next part, some basic procedures that exist in NumPy, pandas, and scikit-learn are covered. This only scratches the surface of the possibilities, and many other functions and code will be used later on. Make sure to search around for the possiblities that exist yourself, and get a grasp of how the modules are called and used. Let's import them in this notebook to start with:

In [None]:
import numpy as np
import pandas as pd
import sklearn

## Numpy

In [None]:
# Create empty arrays/matrices
empty_array = np.zeros(5)

empty_matrix = np.zeros((5,2))

print('Empty array: \n',empty_array)
print('Empty matrix: \n',empty_matrix)

In [None]:
# Create matrices
mat = np.array([[1,2,3],[4,5,6]])
print('Matrix: \n', mat)
print('Transpose: \n', mat.T)
print('Item 2,2: ', mat[1,1])
print('Item 2,3: ', mat[1,2])
print('rows and columns: ', np.shape(mat))
print('Sum total matrix: ', np.sum(mat))
print('Sum row 1: ' , np.sum(mat[0]))
print('Sum row 2: ', np.sum(mat[1]))
print('Sum column 2: ', np.sum(mat,axis=0)[2])

## pandas

### Creating dataframes

pandas is great for reading and creating datasets, as well as performing basic operations on them.

In [None]:
# Creating a matrix with three rows of data
data = [['johannes',10], ['giovanni',2], ['john',3]]

# Creating and printing a pandas DataFrame object from the matrix
df = pd.DataFrame(data)
print(df)

In [None]:
# Adding columns to the DataFrame object
df.columns = ['names', 'years']
print(df)

In [None]:
df_2 = pd.DataFrame(data = data, columns = ['names', 'years'])
print(df_2)

In [None]:
# Taking out a single column and calculating its sum
# This also shows the type of the variable: a 64 bit integer (array)
print(df['years'])
print('Sum of all values in column: ', df['years'].sum())

In [None]:
# Creating a larger matrix
data = [['johannes',10], ['giovanni',2], ['john',3], ['giovanni',2], ['john',3], ['giovanni',2], ['john',3], ['giovanni',2], ['john',3], ['johannes',10]]

# Again, creating a DataFrame object, now with columns
df = pd.DataFrame(data, columns = ['names','years'])

# Print the 5 first (head) and 5 last (tail) observations
print(df.head())
print('\n')
print(df.tail())

### Reading files

You can read files:

In [None]:
dataset = pd.read_csv('data/DM_1.csv')
print(dataset.head())

### Using dataframes

In [None]:
# Print all unique values of the column names
print(df['names'].unique())

In [None]:
# Print all values and their frequency:
print(df['names'].value_counts())
print(df['years'].value_counts())

In [None]:
# Add a column names 'code' with all zeros
df['code'] = np.zeros(10)
print(df)

You can also easily find things in a DataFrame use `.loc`:

In [None]:
# Rows 2 to 5 and all columns:
print(df.loc[2:5, :])

In [None]:
# Looping columns
for variable in df.columns:
    print(df[variable])

In [None]:
# Looping columns and obtaining the values (which returns an array)
for variable in df.columns:
    print(df[variable].values)

### preparing datasets

In [None]:
dataset_1 = pd.read_csv('data/DM_1.csv', encoding='latin1')
dataset_2 = pd.read_csv('data/DM_2.csv', encoding='latin1')

In [None]:
dataset_1

In [None]:
dataset_2

In [None]:
dataset_2.columns = ['First name', 'Last name', 'Days active']
dataset_2

We can convert the second dataset to only have 1 column for names:

In [None]:
# .title() can be used to only make the first letter a capital
names = [dataset_2.loc[i,'First name'] + " " + dataset_2.loc[i,'Last name'].title() for i in range(0, len(dataset_2))]

# Make a new column for the name
dataset_2['Name'] = names

# Remove the old columns
dataset_2 = dataset_2.drop(['First name', 'Last name'], axis=1)
dataset_2

### Bringing together the datasets

Now the datasets are made compatible, we can merge them in a few different ways.

In [None]:
# A left join starts from the left dataset, in this case dataset_1, and for every row matches the value in the 
# column used for joining. As you will see, the result has 22 rows since some names appear multiple times in 
# the second dataset dataset_2.

both = pd.merge(dataset_1, dataset_2, on='Name', how='left')
both

In [None]:
# A right join does the opposite: now, dataset_2 is used to match all names with the corresponding 
# observations in dataset_1. There are as many observations as there are in dataset_2, as the rows 
# in dataset_1 are unique. The last row cannot be matched with any observation in dataset_1.

both = pd.merge(dataset_1, dataset_2, on='Name', how='right')
both

In [None]:
# Inner and outer join
# It is also possible to only retain the values that are matched in both tables, or match any value 
# that matches. This is using an inner and outer join respectively.

both = pd.merge(dataset_1, dataset_2, on='Name', how='inner')
both

Notice how observation 12 is missing, as there is no corresponding value in `dataset_1`.

In [None]:
both = pd.merge(dataset_1, dataset_2, on='Name', how='outer')
both

In the last table, we have 23 rows, as both matching and non-matching values are returned.

Merging datasets can be really helpful. This code should give you ample ideas on how to do this quickly yourself. As always, there are a number of ways of achieving the same result. Don't hold back to explore other solutions that might be quicker or easier.

# scikit-learn

scikit-learn is great for performing all major data analysis operations. It also contains datasets. In this code, we will load a dataset and fit a simple linear regression.

In [None]:
from sklearn import datasets as ds

In [None]:
# Load the Boston Housing dataset
dataset = ds.load_iris()

# It is a dictionary, see the keys for details:
print(dataset.keys())

In [None]:
# The 'DESCR' key holds a description text for the whole dataset
print(dataset['DESCR'])

In [None]:
# The data (independent variables) are stored under the 'data' key
# The names of the independent variables are stored in the 'feature_names' key
# Let's use them to create a DataFrame object:
df = pd.DataFrame(data=dataset['data'], columns=dataset['feature_names'])
print(df.head())

In [None]:
# The dependent variable is stored separately
df_y = pd.DataFrame(data=dataset['target'], columns=['target'])
print(df_y.head())

In [None]:
# Now, let's build a linear regression model
from sklearn.linear_model import LinearRegression as LR

# First we create a linear regression object
regression = LR()

# Then, we fit the independent and dependent data
regression.fit(df, df_y)

# We can obtain the R^2 score (more on this later)
print(regression.score(df, df_y))

Very often, we need to perform an operation on a single observation. In that case, we have to reshape the data using numpy:

In [None]:
# Consider a single observation 
so = df.loc[2, :]
print(so)

# Just the values of the observation without meta data
print(so.values)

# Reshaping yields a new matrix with one row with as many columns as the original observation (indicated by the -1)
print(np.reshape(so.values, (1, -1)))

In [None]:
# For two observations:
so_2 = df.loc[2:3, :]
print(np.reshape(so_2.values, (2, -1)))

This concludes our quick run-through of some basic functionality of the modules. Later on, we will use more and more specialized functions and objects, but for now this allows you to play around with data already.

# Visualisation

The visualisations often require a bit of tricks and extra lines of code to make things look better. This is often confusing at first, but it will become more and more intuitive once you get the hang of how the general ideas work. We will be working mostly with Matplotlib (often imported as plt), Numpy (np), and pandas (pd). Often, both Matplotlib and pandas offer similar solutions, but one is often slightly more convenient than the other in various situations. Make sure to look up some of the alternatives, as they might also make more sense to you.

In [None]:
# First, we need to import our packages
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Pie and bar chart

In [None]:
# Data to plot
labels = 'classification', 'regression', 'time series'
sizes = [10, 22, 2]

colors = ['lightblue', 'lightgreen', 'pink']

# Allows us to highlight a certain piece of the pie chart
explode = (0.1, 0, 0)  
 
# Plot a pie chart with the pie() function. Notice how various parameters are given for coloring, labels, etc.
# They should be relatively self-explanatory
plt.pie(sizes, explode=explode, labels=labels, colors=colors,
        autopct='%1.1f%%', shadow=True, startangle=90)
 
# This function makes the axes equal, so the circle is round
plt.axis('equal')

# Add a title to the plot
plt.title("Pie chart of modelling techniques")

# Finally, show the plot
plt.show()

Adding a legend:

In [None]:
patches, texts = plt.pie(sizes, colors=colors, shadow=True, startangle=90)
plt.legend(patches, labels, loc="best")
plt.axis('equal')
plt.title("Pie chart of modelling techniques")
plt.show()

In [None]:
# Bar charts are relatively similar. Here we use the bar() function
plt.bar(labels, sizes, align='center')
plt.xticks(labels)
plt.ylabel('#use cases')
plt.title('Bar chart of modelling technique')
plt.show()

## Histogram

In [None]:
# This function plots a diagram with the 'data' object providing the data
# bins are calculated automatically, as indicated by the 'auto' option, which makes them relatively balanced and
# sets appropriate boundaries
# color sets the color of the bars
# the rwidth sets the bars to somewhat slightly less wide than the bins are wide to leave space between the bars
data = np.random.normal(10, 2, 1000)
plt.hist(x= data, bins='auto', color='#008000', rwidth=0.85)

# For more information on colour codes, please visit: https://htmlcolorcodes.com/

# Additionally, some options are added:

# This option sets the grid of the plot to follow the values on the y-axis
plt.grid(axis='y')

# Adds a label to the x-axis
plt.xlabel('Value')

# Adds a label to the y-axis
plt.ylabel('Frequency')

# Adds a title to the plot
plt.title('Histogram of x')

# Makes the plot visible in the program
plt.show()

In [None]:
# Here, a different color and manually-specified bins are used
plt.hist(x= data, bins=[0,1,2,3,4,5,6,7,8,9,10], color='olive', rwidth=0.85)
plt.grid(axis='y')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('Histogram of x and y')
plt.show()

See how we cut the tail off the distribution.

In [None]:
# Now, let's build a histogram with radomly generated data that follows a normal distribution
# Mean = 10, stddev = 15, sample size = 1,000
# More on random numbers will follow in module 2
s = np.random.normal(10, 15, 1000)

plt.hist(x=s, bins='auto', color='#008000', rwidth=0.85)
plt.grid(axis='y')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('Histogram of x')
plt.show()

## Boxplot

In [None]:
# Boxplots are even easier. We can just use the boxplot() function without many parameters
# We use the implementation of Pandas, which relies on Matplotlib in the background
# We now use subplots.
data = [3,8,3,4,1,7,5,3,8,2,7,3,1,6,10,10,3,6,5,10]
# Subplot with 1 row, 2 columns, here we add figure 1 of 2 (first row, first column)
plt.subplot(1,2,1)   
plt.boxplot(data)

data_2 = [3,8,3,4,1,7,5,3,8,2,7,3,1,6,10,10,3,6,5,10, 99,87,45,-20]
# Here we add figure 2 of 2, hence it will be positioned in the second column of the first row
plt.subplot(1,2,2)   
plt.boxplot(data_2)
plt.show()

Boxplot for multiple variables:

In [None]:
# Generate 4 columns with 10 observations
df = pd.DataFrame(data = np.random.random(size=(10,3)), columns = ['class.','reg.','time series'])
print(df)

boxplot = df.boxplot()
plt.title('Triple boxplot')
plt.show()

df = pd.DataFrame(data = np.random.random(size=(10,3)), columns = ['class.','reg.','time series'])
df['number_of_runs'] = [0,0,0,1,1,2,0,1,2,0]

boxplot = df.boxplot(by='number_of_runs')
plt.show()

## Scatterplot

In [None]:
# We load the data gain
x = [3,8,3,4,1,7,5,3,8,2,7,3,1,6,10,10,3,6,5,10]
y = [10,7,2,7,5,4,2,3,4,1,5,7,8,4,10,2,3,4,5,6]

# Here, we build a simple scatterplot of the two variables
plt.scatter(x,y)
plt.xlabel('x')
plt.ylabel('y')
plt.title('Simple scatterplot')
plt.show()

Hard to tell which variable is what, but it gives an overall impression of the data.

In [None]:
# A simple line plot

# We use the plot function for this. 'o-' indicates we want to use circles for markers and connect them with lines
plt.plot(x,'o-',color='blue',)

# Here we use 'x--' for cross-shaped markers connected with intermittent lines
plt.plot(y,'x--',color='red')
plt.xlabel('Time')
plt.ylabel('Value')
plt.title("x and y over time")

# This function sets the range limits for the x axis at 0 and 20
plt.xlim(0,20)

# Adding a grid
plt.grid(True)

# Adding markets on the x and y axis. We start at zero, make our way to 10 (the last integer is not included,
# hence we use 21 and 11)
# We add steps of 4 for the x axis, and 4 for the y axis
plt.xticks(range(0,21,4))
plt.yticks(range(0,11,2))

plt.show()

---

# ‚ö†Ô∏è Common Pitfalls & Best Practices

As you work with Python, watch out for these common mistakes:

## 1. Index Starts at 0
Python lists, strings, and arrays start counting at **0**, not 1.
```python
# ‚ùå Wrong assumption
names = ["Alice", "Bob", "Charlie"]
# names[1] is NOT "Alice", it's "Bob"

# ‚úì Correct
first_name = names[0]  # "Alice"
```

## 2. Indentation Matters
Python uses indentation (spaces/tabs) to define code blocks. Be consistent!
```python
# ‚ùå Wrong - IndentationError
if x > 5:
print("Greater")  # Missing indentation

# ‚úì Correct
if x > 5:
    print("Greater")  # 4 spaces indentation
```

## 3. Shallow vs Deep Copy
Assigning a list to another variable creates a reference, not a copy.
```python
# ‚ùå Wrong - both variables point to same list
list1 = [1, 2, 3]
list2 = list1
list2.append(4)
# list1 is now [1, 2, 3, 4] too!

# ‚úì Correct - create a new copy
list1 = [1, 2, 3]
list2 = list1.copy()  # or list1[:]
list2.append(4)
# list1 remains [1, 2, 3]
```

## 4. Type Mismatches
You cannot directly concatenate strings and numbers.
```python
# ‚ùå Wrong - TypeError
age = 25
print("I am " + age)

# ‚úì Correct - convert to string first
print("I am " + str(age))
# or use f-strings (recommended)
print(f"I am {age}")
```

## 5. Mutable Default Arguments
Don't use mutable objects (like lists) as default function arguments.
```python
# ‚ùå Wrong - list persists across function calls
def add_item(item, my_list=[]):
    my_list.append(item)
    return my_list

# ‚úì Correct - use None as default
def add_item(item, my_list=None):
    if my_list is None:
        my_list = []
    my_list.append(item)
    return my_list
```

## 6. Division in Python 3
Be aware of integer vs float division.
```python
# In Python 3:
5 / 2   # = 2.5 (float division)
5 // 2  # = 2 (integer division)
5 % 2   # = 1 (modulo/remainder)
```

## 7. Variable Scope
Variables defined inside functions are local and don't affect global variables with the same name.
```python
x = 10

def my_function():
    x = 5  # This is a local variable
    print(x)  # Prints 5

my_function()
print(x)  # Prints 10 (global x unchanged)
```

## 8. DataFrame Column Names with Spaces
When accessing pandas columns with spaces, use brackets, not dot notation.
```python
df = pd.DataFrame({'user id': [1, 2], 'page views': [100, 200]})

# ‚ùå Wrong - SyntaxError
# df.page views

# ‚úì Correct
df['page views']
```

---

## üí° Best Practices

1. **Use meaningful variable names**: `user_count` instead of `x`
2. **Comment your code**: Explain WHY, not WHAT
3. **Follow PEP 8**: Python's style guide (4 spaces for indentation)
4. **Use f-strings** for string formatting (Python 3.6+)
5. **Handle errors gracefully**: Use try/except blocks for robust code
6. **Test incrementally**: Run code frequently to catch errors early

---

# üéì Week 0 Summary

Congratulations! You've completed the Python fundamentals. Here's what you've learned:

## Key Takeaways

### Basic Python
- ‚úì Variables and data types (strings, numbers)
- ‚úì Control flow (if/elif/else, for, while)
- ‚úì Data structures (lists, tuples, sets, dictionaries)
- ‚úì Functions and comprehensions

### Data Analysis
- ‚úì **NumPy**: Arrays and numerical operations
- ‚úì **Pandas**: DataFrames, reading CSV files, merging datasets
- ‚úì **File I/O**: Reading and writing data files

### Visualization
- ‚úì **Matplotlib**: Pie charts, bar charts, histograms, boxplots, scatterplots
- ‚úì Customizing plots with colors, labels, and titles

### Machine Learning Basics
- ‚úì **scikit-learn**: Loading datasets, building simple models

---

## Next Steps

Now that you have Python fundamentals, you're ready for:

- **Week 1**: Web scraping with BeautifulSoup and Selenium
- **Week 2**: LLM-based scraping and PageRank algorithms
- **Week 3**: Network analysis with NetworkX
- **Week 4**: Clustering and recommendation systems

---

## üìö Additional Resources

- [Python Official Documentation](https://docs.python.org/3/)
- [NumPy Documentation](https://numpy.org/doc/)
- [Pandas Documentation](https://pandas.pydata.org/docs/)
- [Matplotlib Gallery](https://matplotlib.org/stable/gallery/)
- [Scikit-learn Tutorials](https://scikit-learn.org/stable/tutorial/)

---

## ‚úÖ Practice Time!

Head over to `Week0-Exercise.ipynb` to test your knowledge with hands-on exercises!