## Importing pandas

### Getting started and checking your pandas setup

Difficulty: *easy* 

**1.** Import pandas under the alias `pd`.

**2.** Print the version of pandas that has been imported.

**3.** Print out all the *version* information of the libraries that are required by the pandas library.

## DataFrame basics

### A few of the fundamental routines for selecting, sorting, adding and aggregating data in DataFrames

Difficulty: *easy*

Note: remember to import numpy using:
```python
import numpy as np
```

Consider the following Python dictionary `data` and Python list `labels`:

``` python
data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
        'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
        'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
        'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}

labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
```
(This is just some meaningless data I made up with the theme of animals and trips to a vet.)

**4.** Create a DataFrame `df` from this dictionary `data` which has the index `labels`.

In [None]:
import numpy as np

data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
        'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
        'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
        'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}

labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

df = # (complete this line of code)

**5.** Display a summary of the basic information about this DataFrame and its data (*hint: there is a single method that can be called on the DataFrame*).

**6.** Return the first 3 rows of the DataFrame `df`.

**7.** Select just the 'animal' and 'age' columns from the DataFrame `df`.

**8.** Select the data in rows `[3, 4, 8]` *and* in columns `['animal', 'age']`.

**9.** Select only the rows where the number of visits is greater than 3.

**10.** Select the rows where the age is missing, i.e. it is `NaN`.

**11.** Select the rows where the animal is a cat *and* the age is less than 3.

**12.** Select the rows the age is between 2 and 4 (inclusive).

**13.** Change the age in row 'f' to 1.5.

**14.** Calculate the sum of all visits in `df` (i.e. find the total number of visits).

**15.** Calculate the mean age for each different animal in `df`.

**16.** Append a new row 'k' to `df` with your choice of values for each column. Then delete that row to return the original DataFrame.

**17.** Count the number of each type of animal in `df`.

**18.** Sort `df` first by the values in the 'age' in *decending* order, then by the value in the 'visits' column in *ascending* order (so row `i` should be first, and row `d` should be last).

**19.** The 'priority' column contains the values 'yes' and 'no'. Replace this column with a column of boolean values: 'yes' should be `True` and 'no' should be `False`.

**20.** In the 'animal' column, change the 'snake' entries to 'python'.

**21.** For each animal type and each number of visits, find the mean age. In other words, each row is an animal, each column is a number of visits and the values are the mean ages (*hint: use a pivot table*).

## DataFrames: beyond the basics

### Slightly trickier: you may need to combine two or more methods to get the right answer

Difficulty: *medium*

The previous section was tour through some basic but essential DataFrame operations. Below are some ways that you might need to cut your data, but for which there is no single "out of the box" method.

**22.** You have a DataFrame `df` with a column 'A' of integers. For example:
```python
df = pd.DataFrame({'A': [1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 7]})
```

How do you filter out rows which contain the same integer as the row immediately above?

You should be left with a column containing the following values:

```python
1, 2, 3, 4, 5, 6, 7
```

**23.** Given a DataFrame of numeric values, say
```python
df = pd.DataFrame(np.random.random(size=(5, 3))) # a 5x3 frame of float values
```

how do you subtract the row mean from each element in the row?

**24.** Suppose you have DataFrame with 10 columns of real numbers, for example:

```python
df = pd.DataFrame(np.random.random(size=(5, 10)), columns=list('abcdefghij'))
```
Which column of numbers has the smallest sum?  Return that column's label.

**25.** How do you count how many unique rows a DataFrame has (i.e. ignore all rows that are duplicates)? As input, use a DataFrame of zeros and ones with 10 rows and 3 columns.

```python
df = pd.DataFrame(np.random.randint(0, 2, size=(10, 3)))
```

The next three puzzles are slightly harder.


**26.** In the cell below, you have a DataFrame `df` that consists of 10 columns of floating-point numbers. Exactly 5 entries in each row are NaN values. 

For each row of the DataFrame, find the *column* which contains the *third* NaN value.

You should return a Series of column labels: `e, c, d, h, d`

In [None]:
nan = np.nan

data = [[0.04,  nan,  nan, 0.25,  nan, 0.43, 0.71, 0.51,  nan,  nan],
        [ nan,  nan,  nan, 0.04, 0.76,  nan,  nan, 0.67, 0.76, 0.16],
        [ nan,  nan, 0.5 ,  nan, 0.31, 0.4 ,  nan,  nan, 0.24, 0.01],
        [0.49,  nan,  nan, 0.62, 0.73, 0.26, 0.85,  nan,  nan,  nan],
        [ nan,  nan, 0.41,  nan, 0.05,  nan, 0.61,  nan, 0.48, 0.68]]

columns = list('abcdefghij')

df = pd.DataFrame(data, columns=columns)

# write a solution to the question here

**27.** A DataFrame has a column of groups 'grps' and and column of integer values 'vals': 

```python
df = pd.DataFrame({'grps': list('aaabbcaabcccbbc'), 
                   'vals': [12,345,3,1,45,14,4,52,54,23,235,21,57,3,87]})
```
For each *group*, find the sum of the three greatest values. You should end up with the answer as follows:
```
grps
a    409
b    156
c    345
```

In [None]:
df = pd.DataFrame({'grps': list('aaabbcaabcccbbc'), 
                   'vals': [12,345,3,1,45,14,4,52,54,23,235,21,57,3,87]})

# write a solution to the question here

**28.** The DataFrame `df` constructed below has two integer columns 'A' and 'B'. The values in 'A' are between 1 and 100 (inclusive). 

For each group of 10 consecutive integers in 'A' (i.e. `(0, 10]`, `(10, 20]`, ...), calculate the sum of the corresponding values in column 'B'.

The answer should be a Series as follows:

```
A
(0, 10]      635
(10, 20]     360
(20, 30]     315
(30, 40]     306
(40, 50]     750
(50, 60]     284
(60, 70]     424
(70, 80]     526
(80, 90]     835
(90, 100]    852
```

In [None]:
df = pd.DataFrame(np.random.RandomState(8765).randint(1, 101, size=(100, 2)), columns = ["A", "B"])

# write a solution to the question here

**29.** Show how you would handle missing values on given df. Use dropna, fillna and interpolate functions

```

In [11]:
import pandas as pd

df = pd.DataFrame({"A": [1, None, 3], "B": [4, 5, None]})

# df_drop =  (complete this line of code)
# df_fill =  (complete this line of code)
# df_interp =  (complete this line of code)

**30.** On given dataframe show vectorized operation (*2) and usage of apply with lambda function. Which is faster?

```

In [14]:
df = pd.DataFrame({"A": [1, None, 3], "B": [4, 5, None]})


#vectorized
df["C"] = df["A"] * 2  # Fast vectorization

#apply function
df["D"] = df["A"].apply(lambda x: x * 2)  # Slower than vectorized

**31.** To do: Show reading of csv file and applying processing with chunksize

```

**32.** Use merge(), join(), and concat() to combine two provided dataframes.

In [16]:
a = pd.DataFrame({"id":[1,2],"A":[10,20]})
b = pd.DataFrame({"id":[1,2],"B":[30,40]})

# pd_merge = Merge on key
# pd_join =  Join using index
#pd_concat = Append rows

**33.** Show knowledge of groupby(), multiple aggregations, and transform() functions.

In [21]:
df = pd.DataFrame({
    "team": ["A","A","B","B"],
    "score": [10,20,30,40],
    "time": [5,10,15,20]
})

#group by team and execute mean aggregation on score column and sum aggregation on time column
# df_multiple_group = aggregate dataframe

# add score_mean column using transform function
# df["score_mean"] = trasform dataframe

**34.** On given dataframe:


In [32]:
df = pd.DataFrame({
    "id": ["1", "2", "3", "4"],
    "age": ["25", "30", "35", "40"],
    "salary": ["50,000", "60,000", "70,000", "80,000"],
    "department": ["HR", "IT", "IT", "Finance"],
    "join_date": ["2021-01-15", "2020-06-01", "2019-09-23", "2018-03-10"],
    "active": ["True", "False", "True", "True"]
})
# Parsing means interpreting strings into proper types (numbers, dates, booleans).

# Parse join_date to datetime format
# df["join_date"] = parsing

In [26]:
#Parse salary column to numeric format
# df["salary"] = parsing

In [31]:
#convert age and id columns to int using astype, add error handling
# df_converted = your code

| Task                  | Best Tool         |
| --------------------- | ----------------- |
| Force dtype change    | `astype()`        |
| Clean numeric strings | `to_numeric()`    |
| Parse dates           | `to_datetime()`   |
| Memory optimization   | `category`        |
| Handle invalid values | `errors="coerce"` |

**35.** Convert column 'department' to categorogical

In [33]:
df = pd.DataFrame({
    "id": ["1", "2", "3", "4"],
    "age": ["25", "30", "35", "40"],
    "salary": ["50,000", "60,000", "70,000", "80,000"],
    "department": ["HR", "IT", "IT", "Finance"],
    "join_date": ["2021-01-15", "2020-06-01", "2019-09-23", "2018-03-10"],
    "active": ["True", "False", "True", "True"]
})

# df["department"] = your code
df.dtypes
# show all categories
# df["department"].

id            object
age           object
salary        object
department    object
join_date     object
active        object
dtype: object

Categorical (category) dtype

Category is a special pandas dtype for columns with repeated, limited values.

Benefits:
- Reduced memory usage
- Faster comparisons and grouping
- Optional ordering of values

**36.** Index exercises

In [5]:
import pandas as pd

data = {
    "country": ["USA", "USA", "Canada", "Canada"],
    "year": [2022, 2023, 2022, 2023],
    "sales": [100, 150, 90, 120],
    "profit": [30, 50, 25, 40]
}

df = pd.DataFrame(data)

#set a column year as an index
# df_index = your code

In [6]:
#reset index
# df_reset = your code

In [7]:
#set multi index using columns country and year
# df_multi = your code

In [8]:
#give example of selecting by multi index
# your code

Useful multiindex operations
| Operation                       | Code                                                               |
| ------------------------------- | ------------------------------------------------------------------ |
| Swap multi-index levels         | `df_multi.swaplevel()`                                             |
| Remove one index level          | `df_multi.reset_index(level='year')`                               |
| Flatten MultiIndex column names | `df.columns = ['_'.join(col) for col in df.columns]` after groupby |

**37.** Handling duplicates

In [11]:
import pandas as pd

data = {
    "name": ["Alice", "Bob", "Charlie", "Bob", "Alice"],
    "email": [
        "alice@mail.com",
        "bob@mail.com",
        "charlie@mail.com",
        "bob@mail.com",     # duplicate
        "alice@mail.com"    # duplicate
    ],
    "age": [25, 30, 35, 31, 26]  # different values for duplicates
}

df = pd.DataFrame(data)

#detect duplicates on email column keep first 
# dup_mask = your code
# df[dup_mask]

In [12]:
#drop duplicates on email column keep first
# df_cleaned = your code 

**38.**  Switching boolean to reverse

In [14]:
import pandas as pd
import numpy as np

df = pd.DataFrame({
    'age': [25, np.nan, 40, np.nan, 31]
})

#show reverse boolean of isna():
df['age'].isna()

0    False
1     True
2    False
3     True
4    False
Name: age, dtype: bool

**39.** Adding nan values randomly

In [17]:
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.random(size=(5, 3)))

# Create NaN with probability of (20%)
# mask = your code

# df[mask] = np.nan

**40.** Map refresher

In [19]:
import pandas as pd

# Sample Series
s = pd.Series([1, 2, 3, 4, 5])

# Use map to squere each s series number
# squared = your code

In [22]:
# use dictionary to switch string to boolean
s = pd.Series(['yes','yes','no','yes'])
# s_boolean = your code

**41.** Apply refresher

In [24]:
df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [10, 20, 30]
})

# Sum across columns for each row
# row_sum = your code

| Feature     | `map()`                    | `apply()`                  |
| ----------- | -------------------------- | -------------------------- |
| Works on    | **Series only**            | **Series or DataFrame**    |
| Input       | dict, Series, or function  | function (very flexible)   |
| Output      | Series                     | Series or DataFrame        |
| Typical use | Element-wise value mapping | Row/column-wise operations |
| Speed       | Usually faster             | Slower (more general)      |



**42.** Filter refresher

In [26]:
# Sample DataFrame
df = pd.DataFrame({
    "name": ["Alice", "Bob", "Charlie", "David"],
    "age": [25, 30, 35, 40],
    "salary": [50000, 60000, 70000, 80000],
    "department": ["HR", "IT", "IT", "Finance"],
    "start_year": [2018, 2016, 2015, 2012]
})
# Filter columns containing 'a'
# filtered_cols = your code

In [30]:
#Filter specific columns (name and samary)
# df_filter = your code

In [31]:
#Use regex attribute
# df_regex = your code

In [32]:
#Filter rows instead of columns
# df_rows = your code

**43.** Iterrows, index refresher, nunique

In [35]:
import pandas as pd
df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]], columns=["A","B","C"])


#loop throu the rows using iterrows function print column A value and number of the row

# your code

In [36]:
#return index as a list
# your code

In [44]:
#use nunique function to return number of unique values in each column
# s_nunique = your code

**44.** Reading csv,Parquet files

In [48]:
#read csv from url
url = 'https://raw.githubusercontent.com/KeithGalli/complete-pandas-tutorial/refs/heads/master/warmup-data/coffee.csv'
# df_coffee = your code

In [49]:
#read csv from file
path = './pandas/data/coffee.csv'
# df_coffee = your code

CSV format is very popular but its flaw is that it takes a lot storage.\
Parquet is much lighter.

In [51]:
#load parquet file
path = './pandas/data/results.parquet'
# df_results = your code

In [52]:
#load excel file
path = './pandas/data/olympics-data.xlsx'
# olympics_data = your code

In [53]:
#load specific excel sheet
path = './pandas/data/olympics-data.xlsx'
sheet = 'results'
# olympics_data = your code

In [54]:
#check shape of olympics_data df
# your code
#check size of olympic_data df
# your code

**45.** Sampling dataframe

In [55]:
import pandas as pd

coffee = pd.read_csv('./pandas/data/coffee.csv')

#return sample of coffee dataframe using seed
# df_coffee_sample = your code

**46.** Str, isin, query

In [57]:
import pandas as pd

bios = pd.read_csv('./pandas/data/bios.csv')
coffee = pd.read_csv('./pandas/data/coffee.csv')


#use str.contains function to select names that are equal to mateusz or kamil
# bios_filtered = your code

In [58]:
#use isin method to select rows with born_country equal to UKR or POL
# bios_filtered = your code

In [59]:
#use 'query' method to select rows with born_country == "USA" and born_city=="Seattle"
# bios_query = your code

**47.** np.where method, dropping columns, copying dataframe

In [64]:
import pandas as pd

coffee = pd.read_csv('./pandas/data/coffee.csv')
coffee['price'] = 4.99

#using np.where method add column new_price that will be 3.99 for Espresso and 5.99 for the rest
coffee['new_price'] = np.where(coffee['Coffee Type']=='Espresso',3.99,5.99)


In [66]:
#drop price column 
# coffee_filtered = your code

In [70]:
#rename column new_price to price
# your code

In [67]:
#pandas is storing one version of df in memory 
coffee_new = coffee
coffee_new['amount'] = 5
coffee.head()

Unnamed: 0,Day,Coffee Type,Units Sold,price,new_price,amount
0,Monday,Espresso,25,4.99,3.99,5
1,Monday,Latte,15,4.99,5.99,5
2,Tuesday,Espresso,30,4.99,3.99,5
3,Tuesday,Latte,20,4.99,5.99,5
4,Wednesday,Espresso,35,4.99,3.99,5


In [100]:
#to avoid it use copy function
# coffee_new = your code
# coffee_new['storage'] = 1

**48.** str.split method

In [101]:
import pandas as pd
import numpy as np

bios = pd.read_csv('./pandas/data/bios.csv')

#use str.split to select just first name from name column
# bios['first_name'] = your code

**49.** dt methods

In [99]:
# #create born_datetime column converting born_date to datetime format
# bios['born_datetime'] = your code

# #do the same but specifiying error handling and 
# # bios['born_datetime'] = your code

In [97]:
#using dt method create born year column pasrsing year from born datetime column
# bios['born_year'] = your code


In [96]:
#create column that is showing if given born_datetime year is leap or not
# bios['leap_year'] = your code

**50.** Write python function that will take row as an argument, and based on condition of height and weight categorizes sportsman as Lightweight, Middleweight or Heavyweigt. Use apply method to create 'Category' column


height_cm < 175 & wegith_kg < 70      Lightweight

height_cm < 185 or wegith_kg < 80     Middleweight

else                                  Heavyweigt

In [95]:
def categorize_athlete(row):
  pass
  # your code
  
# bios['Category'] = your code

**51.** Merging data

In [112]:
bios = pd.read_csv('./pandas/data/bios.csv')
nocs = pd.read_csv('./pandas/data/noc_regions.csv')


#merge bios with nocs on born_country(bios) and NOC(nocs) - left join
# bios_new = your code

In [111]:
#add provided two dataframes (sql union)
usa = bios[bios['born_country']=='USA'].copy()
gbr = bios[bios['born_country']=='GBR'].copy()

# new_df = your code

In [109]:
#merge results and bios on the same column ()
results = pd.read_parquet('./pandas/data/results.parquet')

# combined_df = your code

**52.** Pivot method

In [108]:
#pivot guven table make Coffee Type columns and Day index
coffee = pd.read_csv('https://raw.githubusercontent.com/KeithGalli/complete-pandas-tutorial/refs/heads/master/warmup-data/coffee.csv')
coffee['price'] = np.where(coffee['Coffee Type']=='Espresso',3.99,5.99)
coffee['revenue'] = coffee['price'] * coffee['Units Sold']

# pivot = your code

**53.** Rank method

In [107]:
#create height_rank column which will show rank on sportsman height
# bios['height_rank'] = your code

#sort bios using this created column
# bios_sorted = your code