# Einführung Data Science - Wiederholung

- Wiederholung Bibliothek `os`
- Wiederholung Bibliothek `pandas`

## Die Bibliothek `os`
### Curent Working Directory (CWD)

In [1]:
import os

# Using os.getcwd() to find the current working directory
current_directory = os.getcwd()

print(f"\nCurrent working directory: {current_directory}\n")


Current working directory: c:\Users\kevin\OneDrive - Hochschule Düsseldorf\MMI\Data Science\data_science_course\notebooks\Moodle notebooks



### Ordner erstellen `makedirs()`

In [None]:
# dataset_directory = 'dataset'
# os.makedirs(dataset_directory)

Versuchen wir das nochmal, bekommen wir einen `FileExistsError`:

In [None]:
# dataset_directory = 'dataset'
# os.makedirs(dataset_directory)

Best Practise für dieses Problem:

In [2]:
dataset_directory = 'dataset'
if not os.path.exists(dataset_directory):
    os.makedirs(dataset_directory)

## Pfade angeben

Um eine Datei zu erstellen oder zu öffnen, müssen wir auf geeignete Weise den genauen Pfad angeben.

Hier unterscheiden wir zwei unterschiedliche Methoden: 
- Relative Pfade (also: relativ zum "aktuellen" Ordner)
- Absolute Pfade (genaue Pfadangabe unabhangig von der aktuellen Position)

### Pfade erstellen/verbinden
Pfade werden bei Python meistens als einfache String angegeben. Das kombinieren von Pfaden (und Dateinamen) kann allerdings zu einigen Fehlern führen. Am beliebtesten: die Trennzeichen `/` bzw. `\\` (beides geht) vs. `\` (geht nicht).

Pfade können kombiniert werden mit `os.path.join()`, z.B. sowas wie `os.path.join(path1, folder1, subfolder1, "my_file.txt")`.

### Relative Pfade

Ein relativer Pfad ist ein Dateipfad, der relativ zum aktuellen Arbeitsverzeichnis (CWD) ist.
Wenn Ihr Arbeitsverzeichnis zum Beispiel '/home/user/project' ist, würde ein relativer Pfad von 'dataset/data.csv' auf '/home/user/project/dataset/data.csv' verweisen.

Beispiele für relative Pfade
Hier sind einige Beispiele für relative Pfade:

#### 'file.txt' -> '/home/user/project/file.txt'

#### 'dataset/data.csv' -> '/home/user/project/dataset/data.csv'

#### '../file.txt' -> '/home/user/file.txt'

### Ordner untersuchen `listdirs()`

#### Leere Dateien erstellen:
Hier erstellen wir leere Dateien einfach mit dem `open` Befehl. Wir nutzen zudem einen absoluten Pfad der über `os.path.join()` erstellt wurde.

In [4]:
cwd = os.getcwd()
dataset_directory = "dataset"
filename = "sample"
filetype = ".csv"

for i in range(6):
    file = f"{filename}{i}{filetype}"
    path = os.path.join(cwd, dataset_directory, file)
    print(f"{i}. Path: {path}\n")

    with open(path, 'w') as creating_new_csv_file: 
        pass 

0. Path: c:\Users\kevin\OneDrive - Hochschule Düsseldorf\MMI\Data Science\data_science_course\notebooks\Moodle notebooks\dataset\sample0.csv

1. Path: c:\Users\kevin\OneDrive - Hochschule Düsseldorf\MMI\Data Science\data_science_course\notebooks\Moodle notebooks\dataset\sample1.csv

2. Path: c:\Users\kevin\OneDrive - Hochschule Düsseldorf\MMI\Data Science\data_science_course\notebooks\Moodle notebooks\dataset\sample2.csv

3. Path: c:\Users\kevin\OneDrive - Hochschule Düsseldorf\MMI\Data Science\data_science_course\notebooks\Moodle notebooks\dataset\sample3.csv

4. Path: c:\Users\kevin\OneDrive - Hochschule Düsseldorf\MMI\Data Science\data_science_course\notebooks\Moodle notebooks\dataset\sample4.csv

5. Path: c:\Users\kevin\OneDrive - Hochschule Düsseldorf\MMI\Data Science\data_science_course\notebooks\Moodle notebooks\dataset\sample5.csv



### Mini-Aufgabe: Ausgabe aller Dateinamen
- Laufe mit einem for-Loop über alle Dateinamen im Ordner und gebe diese mit print aus.
- Nutze dazu `os.listdir(dataset_directory)`

In [6]:
# Using os.listdir() to list the contents of a directory

print(f"\nContents of the '{dataset_directory}' directory:\n")

# Erstelle for loop und gebe alle Dateinamen im Ordner dataset_directory aus

for file in os.listdir(dataset_directory):
    print(file)


Contents of the 'dataset' directory:

sample0.csv
sample1.csv
sample2.csv
sample3.csv
sample4.csv
sample5.csv


---
## Die Bibliothek `pandas`

### Importieren und Alias-Konvention `pd`

In [7]:
# Importing pandas
import pandas as pd

# Creating a DataFrame from a dictionary
data = {'Name': ['Gisela', 'Ursula', 'Jupp', 'Dieter'],
        'Age': [25, 30, 35, 62],
        'City': ['Schabernack', 'Faulebutter', 'Welt', 'Oberbillig']} # Echte deutsche Städtenamen

### Pandas Dataframe aus Dicitionary erstellen

In [8]:
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,City
0,Gisela,25,Schabernack
1,Ursula,30,Faulebutter
2,Jupp,35,Welt
3,Dieter,62,Oberbillig


### Spaltenzugriff

In [9]:
# Accessing a column
print("\nNames in the DataFrame:\n")
print(df['Name'])


Names in the DataFrame:

0    Gisela
1    Ursula
2      Jupp
3    Dieter
Name: Name, dtype: object


### Zeilenzugriff

In [10]:
# Default Index 0, 1, 2....
print("\nDefault Index:\n")
print(df)


Default Index:

     Name  Age         City
0  Gisela   25  Schabernack
1  Ursula   30  Faulebutter
2    Jupp   35         Welt
3  Dieter   62   Oberbillig


In [11]:
# Selecting a row by index position

print(df.iloc[1])

Name         Ursula
Age              30
City    Faulebutter
Name: 1, dtype: object


In [13]:
# Set your own Index
my_index = ["A", "B", "C", "D"]
df_with_index = pd.DataFrame(data, index=my_index)

print("\nMy Index:\n")
print(df_with_index)


My Index:

     Name  Age         City
A  Gisela   25  Schabernack
B  Ursula   30  Faulebutter
C    Jupp   35         Welt
D  Dieter   62   Oberbillig


In [14]:
# Selecting a row by index label
print("\nRow with index label C:\n")
print(df_with_index.loc["C"])


Row with index label C:

Name    Jupp
Age       35
City    Welt
Name: C, dtype: object


### Wert setzen

In [15]:
# Modifying a value
df.loc[1, 'Age'] = 31
print("\nModified DataFrame:\n")
print(df)


Modified DataFrame:

     Name  Age         City
0  Gisela   25  Schabernack
1  Ursula   31  Faulebutter
2    Jupp   35         Welt
3  Dieter   62   Oberbillig


### Dataframe untersuchen (Basics)

In [16]:
# First two rows of the DataFrame
print("\nFirst two rows:\n")
print(df.head(2))

# Last two rows of the DataFrame
print("\nLast two rows:\n")
print(df.tail(2))


First two rows:

     Name  Age         City
0  Gisela   25  Schabernack
1  Ursula   31  Faulebutter

Last two rows:

     Name  Age        City
2    Jupp   35        Welt
3  Dieter   62  Oberbillig


### Was ist der Mittelwert der Spalte `Age`?

In [17]:
# Summary statistics for the DataFrame
print("\nSummary statistics:\n")
print(df.describe())


Summary statistics:

             Age
count   4.000000
mean   38.250000
std    16.357975
min    25.000000
25%    29.500000
50%    33.000000
75%    41.750000
max    62.000000


In [18]:
# Basic information on the DataFrame
print("\nBasic information:\n")
print(df.info())


Basic information:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    4 non-null      object
 1   Age     4 non-null      int64 
 2   City    4 non-null      object
dtypes: int64(1), object(2)
memory usage: 224.0+ bytes
None


### Weitere Daten hinzufügen

#### Wichtiger Hinweis! Wenn man ein DataFrame aus einem Dictionary erstellen will und es zu den Keys nur einen Wert gibt, müssen diese dennoch in einer Liste stehen

In [19]:
# New person's data
new_data = {
    'Name': ['Lotta'],
    'Age': [94],
    'City': ['Hamburg']
}

df_new = pd.DataFrame(new_data)

# Appending the new person to the existing DataFrame
updated_df = pd.concat([df, df_new])

# Updated DataFrame with the new person added
print("\nUpdated DataFrame with new person:\n")
print(updated_df)


Updated DataFrame with new person:

     Name  Age         City
0  Gisela   25  Schabernack
1  Ursula   31  Faulebutter
2    Jupp   35         Welt
3  Dieter   62   Oberbillig
0   Lotta   94      Hamburg


In [22]:
updated_df.loc[0]

Name         Gisela
Age              25
City    Schabernack
Name: 0, dtype: object

### Index ignorieren

In [20]:
# Creating a DataFrame from a dictionary
data = {'Name': ['Gisela', 'Ursula', 'Jupp', 'Dieter'],
        'Age': [25, 30, 35, 62],
        'City': ['Schabernack', 'Faulebutter', 'Welt', 'Oberbillig']}

df = pd.DataFrame(data)

# New person's data
new_data = {
    'Name': ['Lotta'],
    'Age': [94],
    'City': ['Hamburg']
}

df_new = pd.DataFrame(new_data)

# Appending the new person to the existing DataFrame
updated_df = pd.concat([df, df_new], ignore_index=True)

# Updated DataFrame with the new person added
print("\nUpdated DataFrame with new person:")
print(updated_df)


Updated DataFrame with new person:
     Name  Age         City
0  Gisela   25  Schabernack
1  Ursula   30  Faulebutter
2    Jupp   35         Welt
3  Dieter   62   Oberbillig
4   Lotta   94      Hamburg


### Weitere Spalte hinzufügen

In [23]:
new_column = {
    "Height": [160, 170, 190, 150]
}

df_new_column = pd.DataFrame(new_column)

# Concat along the column axis by choosing axis=1
updated_df = pd.concat([updated_df, df_new_column], axis=1)

print(updated_df)

     Name  Age         City  Height
0  Gisela   25  Schabernack   160.0
1  Ursula   30  Faulebutter   170.0
2    Jupp   35         Welt   190.0
3  Dieter   62   Oberbillig   150.0
4   Lotta   94      Hamburg     NaN


In [28]:
# eine Alternative ohne pd.concat wäre
updated_df["Height"] = [160, 170, 190, 150, 12]
updated_df

Unnamed: 0,Name,Age,City,Height
0,Gisela,25,Schabernack,160
1,Ursula,30,Faulebutter,170
2,Jupp,35,Welt,190
3,Dieter,62,Oberbillig,150
4,Lotta,94,Hamburg,12


### CSV Files schreiben und lesen mit Pandas

In [29]:
# Creating simple sample data
sample_data = {
    'Name': ['John', 'Alice', 'Bob', 'Cathy', 'David'],
    'Age': [32, 24, 28, 35, 30],
    'City': ['New York', 'San Francisco', 'Los Angeles', 'Chicago', 'Austin']
}

sample_df = pd.DataFrame(sample_data)

# Save the sample DataFrame as a CSV file in the "dataset" folder
sample_df.to_csv('dataset/sample_data.csv', index=False)

### Mini-Aufgabe: Die gerade geschriebene Datei wieder einlesen!

In [30]:
# Read the CSV file we just saved into a new DataFrame
loaded_df = pd.read_csv('dataset/sample_data.csv')
print("\nLoaded DataFrame from CSV:\n")
print(loaded_df)

# Exercise: Read a CSV file from the "dataset" folder and save a modified version
# In a new code cell, read a CSV file from the "dataset" folder, modify the DataFrame,
# and save the modified DataFrame as a new CSV file.


Loaded DataFrame from CSV:

    Name  Age           City
0   John   32       New York
1  Alice   24  San Francisco
2    Bob   28    Los Angeles
3  Cathy   35        Chicago
4  David   30         Austin


In [34]:
exercise_df = pd.read_csv('dataset/sample_data.csv')
print("Original csv")
print(exercise_df)
print("Modified Dataframe")
exercise_df["Exercises"] = ["Read CSV file", "Modify dataframe", "save dataframe to csv", "pause", "go home"]
print(exercise_df)
exercise_df.to_csv('dataset/exercise_data.csv', index=False)

Original csv
    Name  Age           City
0   John   32       New York
1  Alice   24  San Francisco
2    Bob   28    Los Angeles
3  Cathy   35        Chicago
4  David   30         Austin
Modified Dataframe
    Name  Age           City              Exercises
0   John   32       New York          Read CSV file
1  Alice   24  San Francisco       Modify dataframe
2    Bob   28    Los Angeles  save dataframe to csv
3  Cathy   35        Chicago                  pause
4  David   30         Austin                go home


### Kurzer Einblick in weitere Pandas Funktionen

In [35]:
# Sorting a DataFrame

sorted_df = sample_df.sort_values(by="Age")
print("\nSorted DataFrame by age:\n")
print(sorted_df)


Sorted DataFrame by age:

    Name  Age           City
1  Alice   24  San Francisco
2    Bob   28    Los Angeles
4  David   30         Austin
0   John   32       New York
3  Cathy   35        Chicago


In [36]:
# Filtering a DataFrame

mask = sample_df['Age'] > 30
filtered_df = sample_df[mask]
print("\nFiltered DataFrame with ages greater than 30:\n")
print(filtered_df)


Filtered DataFrame with ages greater than 30:

    Name  Age      City
0   John   32  New York
3  Cathy   35   Chicago


In [37]:
# Renaming columns in a DataFrame

renamed_df = sample_df.rename(columns={'Name': 'Full Name', 'Age': 'Age in Years', 'City': 'Hometown'})
print("\nRenamed columns in DataFrame:\n")
print(renamed_df)


Renamed columns in DataFrame:

  Full Name  Age in Years       Hometown
0      John            32       New York
1     Alice            24  San Francisco
2       Bob            28    Los Angeles
3     Cathy            35        Chicago
4     David            30         Austin


In [38]:
# Adding a new column to a DataFrame

sample_df['Country'] = 'USA'
print("\nDataFrame with a new column 'Country':\n")
print(sample_df)


DataFrame with a new column 'Country':

    Name  Age           City Country
0   John   32       New York     USA
1  Alice   24  San Francisco     USA
2    Bob   28    Los Angeles     USA
3  Cathy   35        Chicago     USA
4  David   30         Austin     USA


In [39]:
# Dropping a column from a DataFrame

dropped_df = sample_df.drop('Country', axis=1)
print("\nDataFrame with the 'Country' column dropped:\n")
print(dropped_df)


DataFrame with the 'Country' column dropped:

    Name  Age           City
0   John   32       New York
1  Alice   24  San Francisco
2    Bob   28    Los Angeles
3  Cathy   35        Chicago
4  David   30         Austin
