# Cheat Sheet: Working with Data in Python

## Reading and writing files

### 1. File opening modes

<p>Different modes to open files for specific operations.</p>

**Syntax**:
- r (reading) w (writing) a (appending) + (updating: read/write) b (binary, otherwise text)
```Python
with open("test1.txt", "r") as file:
    content = file.read()
    print(content)

with open("test1.txt", "w") as file:
    file.write("Hello!")

with open("test1.txt", "a") as file:
    file.write("Something")

with open("test1.txt", "r+") as file:
    content = file.read()
    file.write("Something")
```

### 2. File reading methods

<p>Different methods to read file content in various ways.</p>

**Syntax**:

```Python
with open("test2.txt", "r") as file:
    file.readlines()    # reads all lines as a list
    file.readline()     # reads the next line as a string
    file.read()     # reads the entire file content as a string
```

In [1]:
import os

dir_name = os.path.join(".", "data")
file1 = os.path.join(dir_name, "example2.txt")

with open(file1, "r") as file:
    lines = file.readlines()
    next_line = file.readline()
    content = file.read()

print(f"lines: {lines}")
print(f"next line: {next_line}")
print(f"content: {content}")

lines: ['Line 1. \n', 'Line 2. \n', 'Line 3. \n', 'Finished. \n']
next line: 
content: 


### 3. File writing methods

<p>Different write methods to write content to a file.</p>

**Syntax**:

```Python
with open("test3.txt", "w") as file:
    file.write(content)  # writes a string to the file
    file.writelines(lines)  # writes a list of strings to the file
```

In [2]:
file2 = os.path.join(dir_name, "example4.txt")

lines = ["Hello\n", "World!\n"]
with open(file2, "w") as file:
    file.writelines(lines)

### 4. Iterating over lines

<p>Iterates through each line in the file using a <code>loop</code>.</p>

**Syntax**:

```Python
with open(filename, mode) as file:
    for line in file:
        print(line)
```

In [3]:
with open(file1, "r") as file:
    for line in file:
        print(line)

Line 1. 

Line 2. 

Line 3. 

Finished. 



### 5. `open()` and `close()`

<p>Opens a file, performs operations, and explicitly closes the file using the <code>close()</code> method.</p>

**Syntax**:

```Python
file = open(filename, mode)
file.close()
```

In [4]:
file = open(file1, "r")
content = file.read()
file.close()
content

'Line 1. \nLine 2. \nLine 3. \nFinished. \n'

### 6. `with open()`

<p>Opens a file using a with block, ensuring automatic file closure after usage.</p>

**Syntax**:

```Python
with open(filename, mode) as file:
    pass
```

In [5]:
with open(file1, "r") as file:
    content = file.read()

content

'Line 1. \nLine 2. \nLine 3. \nFinished. \n'

## Pandas

### 1. import `pandas`

<p>Imports the <code>pandas</code> library with the alias pd.</p>

In [6]:
import pandas as pd

### 2. `read_csv()`

<p>Reads data from a <code>csv</code> file and creates a DataFrame.</p>

**Syntax**:

```Python
df = pd.read_csv(filename_csv)
```

In [7]:
file3 = os.path.join(dir_name, "TopSellingAlbums.csv")
df = pd.read_csv(file3)
df

Unnamed: 0,Artist,Album,Released,Length,Genre,Music Recording Sales (millions),Claimed Sales (millions),Released.1,Soundtrack,Rating
0,Michael Jackson,Thriller,1982,0:42:19,"pop, rock, R&B",46.0,65,30-Nov-82,,10.0
1,AC/DC,Back in Black,1980,0:42:11,hard rock,26.1,50,25-Jul-80,,9.5
2,Pink Floyd,The Dark Side of the Moon,1973,0:42:49,progressive rock,24.2,45,01-Mar-73,,9.0
3,Whitney Houston,The Bodyguard,1992,0:57:44,"R&B, soul, pop",27.4,44,17-Nov-92,Y,8.5
4,Meat Loaf,Bat Out of Hell,1977,0:46:33,"hard rock, progressive rock",20.6,43,21-Oct-77,,8.0
5,Eagles,Their Greatest Hits (1971-1975),1976,0:43:08,"rock, soft rock, folk rock",32.2,42,17-Feb-76,,7.5
6,Bee Gees,Saturday Night Fever,1977,1:15:54,disco,20.6,40,15-Nov-77,Y,7.0
7,Fleetwood Mac,Rumours,1977,0:40:01,soft rock,27.9,40,04-Feb-77,,6.5


### 3. `read_excel()`

<p>Reads data from an Excel file and creates a DataFrame.</p>

**Syntax**:

```Python
df = pd.read_excel(filename_excel)
```

In [8]:
file4 = os.path.join(dir_name, "TopSellingAlbums.xlsx")
df = pd.read_excel(file4)
df

Unnamed: 0,Artist,Album,Released,Length,Genre,Music Recording Sales (millions),Claimed Sales (millions),Released.1,Soundtrack,Rating
0,Michael Jackson,Thriller,1982,00:42:19,"pop, rock, R&B",46.0,65,1982-11-30,,10.0
1,AC/DC,Back in Black,1980,0:42:11,hard rock,26.1,50,1980-07-25,,9.5
2,Pink Floyd,The Dark Side of the Moon,1973,0:42:49,progressive rock,24.2,45,1973-03-01,,9.0
3,Whitney Houston,The Bodyguard,1992,0:57:44,"R&B, soul, pop",27.4,44,1992-11-17,Y,8.5
4,Meat Loaf,Bat Out of Hell,1977,0:46:33,"hard rock, progressive rock",20.6,43,1977-10-21,,8.0
5,Eagles,Their Greatest Hits (1971-1975),1976,0:43:08,"rock, soft rock, folk rock",32.2,42,1976-02-17,,7.5
6,Bee Gees,Saturday Night Fever,1977,01:15:54,disco,20.6,40,1977-11-15,Y,7.0
7,Fleetwood Mac,Rumours,1977,0:40:01,soft rock,27.9,40,1977-02-04,,6.5


### 4. `to_csv()`

<p>Writes DataFrame to a <code>csv</code> file.</p>

**Syntax**:

```Python
df.to_csv(filename, index=False)
```

### 5. Access Columns

<p>Accesses a specific column using <code>[[]]</code> in the DataFrame.</p>

**Syntax**:

```Python
df[["col"]]  # Accesses single column
df[["col1", "col2"]]    # Accesses multiple column
```

In [9]:
df[["Artist"]]

Unnamed: 0,Artist
0,Michael Jackson
1,AC/DC
2,Pink Floyd
3,Whitney Houston
4,Meat Loaf
5,Eagles
6,Bee Gees
7,Fleetwood Mac


In [10]:
df[["Artist", "Album"]]

Unnamed: 0,Artist,Album
0,Michael Jackson,Thriller
1,AC/DC,Back in Black
2,Pink Floyd,The Dark Side of the Moon
3,Whitney Houston,The Bodyguard
4,Meat Loaf,Bat Out of Hell
5,Eagles,Their Greatest Hits (1971-1975)
6,Bee Gees,Saturday Night Fever
7,Fleetwood Mac,Rumours


### 6. `describe()`

<p>Generates statistics summary of numeric columns in the DataFrame.</p>

**Syntax**:

```Python
df.describe()
```

In [11]:
df.describe()

Unnamed: 0,Released,Music Recording Sales (millions),Claimed Sales (millions),Released.1,Rating
count,8.0,8.0,8.0,8,8.0
mean,1979.25,28.125,46.125,1979-10-20 21:00:00,8.25
min,1973.0,20.6,40.0,1973-03-01 00:00:00,6.5
25%,1976.75,23.3,41.5,1976-11-07 18:00:00,7.375
50%,1977.0,26.75,43.5,1977-11-02 12:00:00,8.25
75%,1980.5,28.975,46.25,1981-02-24 12:00:00,9.125
max,1992.0,46.0,65.0,1992-11-17 00:00:00,10.0
std,5.800246,8.189322,8.271077,,1.224745


### 7. `drop()`

<p>Removes specified rows or columns from the DataFrame. <code>axis=1</code> indicates columns. <code>axis=0</code> indicates rows.</p>

**Syntax**:

```Python
df.drop(["column1", "column2"], axis=1, inplace=True)
df.drop(index=["ro1", "row2"], axis=0, inplace=True)
```

In [12]:
df.drop(["Soundtrack"], axis=1, inplace=True)
df

Unnamed: 0,Artist,Album,Released,Length,Genre,Music Recording Sales (millions),Claimed Sales (millions),Released.1,Rating
0,Michael Jackson,Thriller,1982,00:42:19,"pop, rock, R&B",46.0,65,1982-11-30,10.0
1,AC/DC,Back in Black,1980,0:42:11,hard rock,26.1,50,1980-07-25,9.5
2,Pink Floyd,The Dark Side of the Moon,1973,0:42:49,progressive rock,24.2,45,1973-03-01,9.0
3,Whitney Houston,The Bodyguard,1992,0:57:44,"R&B, soul, pop",27.4,44,1992-11-17,8.5
4,Meat Loaf,Bat Out of Hell,1977,0:46:33,"hard rock, progressive rock",20.6,43,1977-10-21,8.0
5,Eagles,Their Greatest Hits (1971-1975),1976,0:43:08,"rock, soft rock, folk rock",32.2,42,1976-02-17,7.5
6,Bee Gees,Saturday Night Fever,1977,01:15:54,disco,20.6,40,1977-11-15,7.0
7,Fleetwood Mac,Rumours,1977,0:40:01,soft rock,27.9,40,1977-02-04,6.5


In [13]:
df.drop(index=[6, 7], axis=0, inplace=True)
df

Unnamed: 0,Artist,Album,Released,Length,Genre,Music Recording Sales (millions),Claimed Sales (millions),Released.1,Rating
0,Michael Jackson,Thriller,1982,00:42:19,"pop, rock, R&B",46.0,65,1982-11-30,10.0
1,AC/DC,Back in Black,1980,0:42:11,hard rock,26.1,50,1980-07-25,9.5
2,Pink Floyd,The Dark Side of the Moon,1973,0:42:49,progressive rock,24.2,45,1973-03-01,9.0
3,Whitney Houston,The Bodyguard,1992,0:57:44,"R&B, soul, pop",27.4,44,1992-11-17,8.5
4,Meat Loaf,Bat Out of Hell,1977,0:46:33,"hard rock, progressive rock",20.6,43,1977-10-21,8.0
5,Eagles,Their Greatest Hits (1971-1975),1976,0:43:08,"rock, soft rock, folk rock",32.2,42,1976-02-17,7.5


### 8. `dropna()`

<p>Removes rows with missing <b>NaN</b> values from the DataFrame. <code>axis=0</code> indicates rows.</p>

**Syntax**:

```Python
df.dropna(axis=0, inplace=True)
```

In [14]:
df.dropna(axis=0, inplace=True)
df

Unnamed: 0,Artist,Album,Released,Length,Genre,Music Recording Sales (millions),Claimed Sales (millions),Released.1,Rating
0,Michael Jackson,Thriller,1982,00:42:19,"pop, rock, R&B",46.0,65,1982-11-30,10.0
1,AC/DC,Back in Black,1980,0:42:11,hard rock,26.1,50,1980-07-25,9.5
2,Pink Floyd,The Dark Side of the Moon,1973,0:42:49,progressive rock,24.2,45,1973-03-01,9.0
3,Whitney Houston,The Bodyguard,1992,0:57:44,"R&B, soul, pop",27.4,44,1992-11-17,8.5
4,Meat Loaf,Bat Out of Hell,1977,0:46:33,"hard rock, progressive rock",20.6,43,1977-10-21,8.0
5,Eagles,Their Greatest Hits (1971-1975),1976,0:43:08,"rock, soft rock, folk rock",32.2,42,1976-02-17,7.5


### 9. `duplicated()`

<p>Duplicate or repetitive values or records within a data set.</p>

**Syntax**:

```Python
df.duplicated()
```

In [15]:
dr = df[df.duplicated()]
dr

Unnamed: 0,Artist,Album,Released,Length,Genre,Music Recording Sales (millions),Claimed Sales (millions),Released.1,Rating


### 10. Filter Rows

<p>Creates a new DataFrame with rows that meet specified conditions.</p>

**Syntax**:

```Python
filtered_df = df[(conditional_statements)]
```

In [16]:
filtered_df = df[df["Released"] > 1979]
filtered_df

Unnamed: 0,Artist,Album,Released,Length,Genre,Music Recording Sales (millions),Claimed Sales (millions),Released.1,Rating
0,Michael Jackson,Thriller,1982,00:42:19,"pop, rock, R&B",46.0,65,1982-11-30,10.0
1,AC/DC,Back in Black,1980,0:42:11,hard rock,26.1,50,1980-07-25,9.5
3,Whitney Houston,The Bodyguard,1992,0:57:44,"R&B, soul, pop",27.4,44,1992-11-17,8.5


### 11. `groupby()`

<p>Splits a DataFrame into groups based on specified criteria, enabling subsequent aggregation, transformation, or analysis within each group.</p>

**Syntax**:

```Python
grouped = df.groupby(by, axis=0, level=None, as_index=True,
sort=True, group_keys=True, squeeze=False, observed=False, dropna=True)
```

In [17]:
grouped = df.groupby(["Released"])
grouped

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001C73D1EE660>

### 12. `head()`

<p>Displays the first n rows of the DataFrame.</p>

**Syntax**:

```Python
df.head(n)
```

In [18]:
df.head(2)

Unnamed: 0,Artist,Album,Released,Length,Genre,Music Recording Sales (millions),Claimed Sales (millions),Released.1,Rating
0,Michael Jackson,Thriller,1982,00:42:19,"pop, rock, R&B",46.0,65,1982-11-30,10.0
1,AC/DC,Back in Black,1980,0:42:11,hard rock,26.1,50,1980-07-25,9.5


### 13. `info()`

<p>Provides information about the DataFrame, including data types and memory usage.</p>

**Syntax**:

```Python
df.info()
```

In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 9 columns):
 #   Column                            Non-Null Count  Dtype         
---  ------                            --------------  -----         
 0   Artist                            6 non-null      object        
 1   Album                             6 non-null      object        
 2   Released                          6 non-null      int64         
 3   Length                            6 non-null      object        
 4   Genre                             6 non-null      object        
 5   Music Recording Sales (millions)  6 non-null      float64       
 6   Claimed Sales (millions)          6 non-null      int64         
 7   Released.1                        6 non-null      datetime64[ns]
 8   Rating                            6 non-null      float64       
dtypes: datetime64[ns](1), float64(2), int64(2), object(4)
memory usage: 564.0+ bytes


### 14. `merge()`

<p>Merges two DataFrames based on multiple common columns.</p>

**Syntax**:

```Python
merged_df = pd.merge(df1, df2, on=["column1", "column2"])
```

### 15. print DataFrame

<p>Displays the content of the DataFrame.</p>

**Syntax**:

```Python
print(df)   # in Jupyter you can just type df
```

In [20]:
print(df)

            Artist                            Album  Released    Length  \
0  Michael Jackson                         Thriller      1982  00:42:19   
1            AC/DC                    Back in Black      1980   0:42:11   
2       Pink Floyd        The Dark Side of the Moon      1973   0:42:49   
3  Whitney Houston                    The Bodyguard      1992   0:57:44   
4        Meat Loaf                  Bat Out of Hell      1977   0:46:33   
5           Eagles  Their Greatest Hits (1971-1975)      1976   0:43:08   

                         Genre  Music Recording Sales (millions)  \
0               pop, rock, R&B                              46.0   
1                    hard rock                              26.1   
2             progressive rock                              24.2   
3               R&B, soul, pop                              27.4   
4  hard rock, progressive rock                              20.6   
5   rock, soft rock, folk rock                              32.2  

In [21]:
df

Unnamed: 0,Artist,Album,Released,Length,Genre,Music Recording Sales (millions),Claimed Sales (millions),Released.1,Rating
0,Michael Jackson,Thriller,1982,00:42:19,"pop, rock, R&B",46.0,65,1982-11-30,10.0
1,AC/DC,Back in Black,1980,0:42:11,hard rock,26.1,50,1980-07-25,9.5
2,Pink Floyd,The Dark Side of the Moon,1973,0:42:49,progressive rock,24.2,45,1973-03-01,9.0
3,Whitney Houston,The Bodyguard,1992,0:57:44,"R&B, soul, pop",27.4,44,1992-11-17,8.5
4,Meat Loaf,Bat Out of Hell,1977,0:46:33,"hard rock, progressive rock",20.6,43,1977-10-21,8.0
5,Eagles,Their Greatest Hits (1971-1975),1976,0:43:08,"rock, soft rock, folk rock",32.2,42,1976-02-17,7.5


### 16. `replace()`

<p>Replaces specific values in a column with new values.</p>

**Syntax**:

```Python
df["column"].replace(old_value, new_value)
```

In [22]:
df["Rating"].replace(7.0, 9.0)
df

Unnamed: 0,Artist,Album,Released,Length,Genre,Music Recording Sales (millions),Claimed Sales (millions),Released.1,Rating
0,Michael Jackson,Thriller,1982,00:42:19,"pop, rock, R&B",46.0,65,1982-11-30,10.0
1,AC/DC,Back in Black,1980,0:42:11,hard rock,26.1,50,1980-07-25,9.5
2,Pink Floyd,The Dark Side of the Moon,1973,0:42:49,progressive rock,24.2,45,1973-03-01,9.0
3,Whitney Houston,The Bodyguard,1992,0:57:44,"R&B, soul, pop",27.4,44,1992-11-17,8.5
4,Meat Loaf,Bat Out of Hell,1977,0:46:33,"hard rock, progressive rock",20.6,43,1977-10-21,8.0
5,Eagles,Their Greatest Hits (1971-1975),1976,0:43:08,"rock, soft rock, folk rock",32.2,42,1976-02-17,7.5


### 17. `tail()`

<p>Displays the last n rows of the DataFrame.</p>

**Syntax**:

```Python
df.tail(n)
```

In [23]:
df.tail(3)

Unnamed: 0,Artist,Album,Released,Length,Genre,Music Recording Sales (millions),Claimed Sales (millions),Released.1,Rating
3,Whitney Houston,The Bodyguard,1992,0:57:44,"R&B, soul, pop",27.4,44,1992-11-17,8.5
4,Meat Loaf,Bat Out of Hell,1977,0:46:33,"hard rock, progressive rock",20.6,43,1977-10-21,8.0
5,Eagles,Their Greatest Hits (1971-1975),1976,0:43:08,"rock, soft rock, folk rock",32.2,42,1976-02-17,7.5


## Numpy

### 1. Import `NumPy`

<p>Imports the <code>NumPy</code> library.</p>

In [24]:
import numpy as np

### 2. `np.array()`

<p>Creates a one or multi-dimensional array.</p>

**Syntax**:

```Python
array_1d = np.array([list_elements])
array_2d = np.array([[list1_elements], [list2_elements]])
```

In [25]:
array_1d = np.array([1, 2, 3])
array_1d

array([1, 2, 3])

In [26]:
array_2d = np.array([[1, 2], [3, 4], [5, 6]])
array_2d

array([[1, 2],
       [3, 4],
       [5, 6]])

### 3. Array Attributes

- Calculates the mean of array elements
- Calculates the sum of array elements
- Finds the minimum value in the array
- Finds the maximum value in the array
- Computes dot product of two arrays

In [27]:
np.mean(array_1d)

np.float64(2.0)

In [28]:
np.sum(array_1d)

np.int64(6)

In [29]:
np.min(array_1d)

np.int64(1)

In [30]:
np.max(array_1d)

np.int64(3)

In [31]:
np.dot(array_1d, array_2d)

array([22, 28])

****
This is the end of the file.
****