## INTRO

**Script:** "Hi everyone, welcome back to our Python series! If you've been following along, you should already be familiar with setting up your Python environment, organizing your project files, and managing virtual environments using Anaconda. If you haven't watched those videos yet, I recommend checking them out first, as they'll provide a solid foundation for today's tutorial.

In this video, we're going to dive into Python Pandas, an incredibly powerful library for data manipulation and analysis. Whether you're just starting out or looking to sharpen your skills, Pandas is a must-know tool for anyone working with data in Python. We'll cover everything from loading and exploring data to performing basic data manipulations and saving your work. By the end of this tutorial, you'll have a strong grasp of how to use Pandas for your data analysis projects."

**Visual:** Show a brief intro slide with mentions of previous videos and the Pandas logo.

### [SECTION 1: Installing Pandas]
**Script:** "Let's start by ensuring you have Pandas installed in your Python environment. If you're using Anaconda, Pandas usually comes pre-installed, so you might not need to install it manually. However, if you're working in a different environment or just want to double-check, you can easily install Pandas using the pip package manager.

To do this, open your terminal or command prompt and type the following command: `pip install pandas`. This will download and install Pandas along with any dependencies it needs. Once it's installed, you're ready to start using it in your projects."

**Visual:** Show the command running in a terminal.

### [SECTION 2: Importing Pandas]

**Script:** "Now that Pandas is installed, the next step is to import it into your Python script or Jupyter notebook. The convention is to import Pandas using the alias `pd`, which makes it easier to reference Pandas functions throughout your code.

Here's how you do it: `import pandas as pd`. This line of code imports the Pandas library and allows you to use `pd` as a shorthand whenever you want to call a Pandas function. It's a small step, but it will save you time and effort in the long run."

**Visual:** Show the code snippet in a Python script.

In [3]:
import pandas as pd

### [SECTION 3: Loading Data]

**Script:** "With Pandas imported, we can start working with data. One of the most common tasks in data analysis is loading data from a file. Pandas supports reading data from a variety of formats, including CSV, Excel, SQL databases, and more.

In this example, we'll load a CSV file into a Pandas DataFrame. A DataFrame is like a table in a database or an Excel spreadsheet, where data is stored in rows and columns. To load a CSV file, use the `pd.read_csv()` function, passing in the file name as a parameter. Let's say you have a file named 'your_dataset.csv'. You would load it like this: `df = pd.read_csv('your_dataset.csv')`.

Once the data is loaded, it's a good idea to take a quick look at the first few rows to make sure everything loaded correctly. You can do this by calling the `head()` method on your DataFrame. For example, `df.head()` will display the first five rows of the DataFrame, giving you a snapshot of your data."

Visual: Show the df.head() output in the terminal.

In [4]:
# Load data from a CSV file
df = pd.read_csv('../Dataset/heart copy.csv')

# Display the first few rows of the dataframe
print(df.head())

   age  sex  cp  trestbps  chol  fbs  restecg  thalach  exang  oldpeak  slope  \
0   52    1   0       125   212    0        1      168      0      1.0      2   
1   53    1   0       140   203    1        0      155      1      3.1      0   
2   70    1   0       145   174    0        1      125      1      2.6      0   
3   61    1   0       148   203    0        1      161      0      0.0      2   
4   62    0   0       138   294    1        1      106      0      1.9      1   

   ca  thal  target  
0   2     3       0  
1   0     3       0  
2   0     3       0  
3   1     3       0  
4   3     2       0  


### [SECTION 4: Exploring the Data]

**Script:** "After loading your data, the next step is to explore it. Pandas provides several useful functions to help you understand your data better.

First, you might want to check the shape of your DataFrame to see how many rows and columns it has. You can do this with the `shape` attribute: `df.shape`. This will return a tuple, where the first number is the number of rows, and the second is the number of columns.

Next, you can get a summary of your DataFrame using the `info()` method. This will tell you the data types of each column, how many non-null values there are, and how much memory the DataFrame is using. This information can be very helpful in identifying any missing data or potential issues.

Finally, to get a quick overview of the statistical properties of your numerical columns, you can use the `describe()` method. This will give you statistics like the mean, standard deviation, minimum, and maximum values for each numeric column. These basic statistics can provide valuable insights into the distribution of your data."

Visual: Show each output, explaining what each function does.

In [5]:
# Display the shape of the dataframe
print(df.shape)

(1025, 14)


In [6]:
# Get a summary of the dataframe
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1025 entries, 0 to 1024
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1025 non-null   int64  
 1   sex       1025 non-null   int64  
 2   cp        1025 non-null   int64  
 3   trestbps  1025 non-null   int64  
 4   chol      1025 non-null   int64  
 5   fbs       1025 non-null   int64  
 6   restecg   1025 non-null   int64  
 7   thalach   1025 non-null   int64  
 8   exang     1025 non-null   int64  
 9   oldpeak   1025 non-null   float64
 10  slope     1025 non-null   int64  
 11  ca        1025 non-null   int64  
 12  thal      1025 non-null   int64  
 13  target    1025 non-null   int64  
dtypes: float64(1), int64(13)
memory usage: 112.2 KB
None


In [7]:
# Display basic statistics of the numerical columns
print(df.describe())

               age          sex           cp     trestbps        chol  \
count  1025.000000  1025.000000  1025.000000  1025.000000  1025.00000   
mean     54.434146     0.695610     0.942439   131.611707   246.00000   
std       9.072290     0.460373     1.029641    17.516718    51.59251   
min      29.000000     0.000000     0.000000    94.000000   126.00000   
25%      48.000000     0.000000     0.000000   120.000000   211.00000   
50%      56.000000     1.000000     1.000000   130.000000   240.00000   
75%      61.000000     1.000000     2.000000   140.000000   275.00000   
max      77.000000     1.000000     3.000000   200.000000   564.00000   

               fbs      restecg      thalach        exang      oldpeak  \
count  1025.000000  1025.000000  1025.000000  1025.000000  1025.000000   
mean      0.149268     0.529756   149.114146     0.336585     1.071512   
std       0.356527     0.527878    23.005724     0.472772     1.175053   
min       0.000000     0.000000    71.000000  

### [SECTION 5: Selecting Data]

**Script:** "Once you're familiar with your data, you'll likely want to select specific parts of it for analysis. Pandas makes it easy to select data from your DataFrame, whether you're interested in individual columns, rows, or specific values.

To select a single column, you can use the bracket notation like this: `df['column_name']`. This will return a Series, which is essentially a single column from the DataFrame.

If you want to select multiple columns, you can pass a list of column names: `df[['column1', 'column2']]`. This will return a new DataFrame containing only the specified columns.

You can also select rows by their index using the `iloc` method. For example, `df.iloc[0:5]` will return the first five rows of the DataFrame. The first number indicates the starting row and the second number tells about how many numbers of rows you want to print.

In [8]:
# Select a single column
print(df['age'])

0       52
1       53
2       70
3       61
4       62
        ..
1020    59
1021    60
1022    47
1023    50
1024    54
Name: age, Length: 1025, dtype: int64


In [9]:
# Select multiple columns
print(df[['age', 'sex']])

      age  sex
0      52    1
1      53    1
2      70    1
3      61    1
4      62    0
...   ...  ...
1020   59    1
1021   60    1
1022   47    1
1023   50    0
1024   54    1

[1025 rows x 2 columns]


In [10]:
# Select rows by index
print(df.iloc[0:5])

   age  sex  cp  trestbps  chol  fbs  restecg  thalach  exang  oldpeak  slope  \
0   52    1   0       125   212    0        1      168      0      1.0      2   
1   53    1   0       140   203    1        0      155      1      3.1      0   
2   70    1   0       145   174    0        1      125      1      2.6      0   
3   61    1   0       148   203    0        1      161      0      0.0      2   
4   62    0   0       138   294    1        1      106      0      1.9      1   

   ca  thal  target  
0   2     3       0  
1   0     3       0  
2   0     3       0  
3   1     3       0  
4   3     2       0  


### [SECTION 6: Data Manipulation]

**Script:** "Pandas also provides powerful tools for data manipulation. You can easily add new columns, drop unnecessary ones, and handle missing data with just a few lines of code.

To add a new column, you can simply assign a value to a new column name. For example, `df['new_column'] = df['column1'] + df['column2']` will create a new column called `'new_column'`, where each value is the sum of the corresponding values in `'column1'` and `'column2'`.

If you have columns that you don't need, you can drop them using the `drop()` method. For instance, `df = df.drop('column_name', axis=1)` will remove the specified column from the DataFrame. The `axis=1` parameter indicates that you're dropping a column, as opposed to a row.

Handling missing data is also straightforward with Pandas. You can fill missing values using the `fillna()` method, like this: `df['column_name'] = df['column_name'].fillna(value)`. Here the `value` can be mean , median or any number you want to fill with. Alternatively, if you prefer to remove rows with missing data altogether, you can use the `dropna()` method: `df = df.dropna()`."

Visual: Show how the DataFrame changes with each operation.

In [11]:
# Add a new column
df['demo_01'] = df['age'] + df['sex']

In [12]:
# Drop a column
df = df.drop('cp', axis=1)

In [13]:
df.head(10)

Unnamed: 0,age,sex,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target,demo_01
0,52,1,125,212,0,1,168,0,1.0,2,2,3,0,53
1,53,1,140,203,1,0,155,1,3.1,0,0,3,0,54
2,70,1,145,174,0,1,125,1,2.6,0,0,3,0,71
3,61,1,148,203,0,1,161,0,0.0,2,1,3,0,62
4,62,0,138,294,1,1,106,0,1.9,1,3,2,0,62
5,58,0,100,248,0,0,122,0,1.0,1,0,2,1,58
6,58,1,114,318,0,2,140,0,4.4,0,3,1,0,59
7,55,1,160,289,0,0,145,1,0.8,1,1,3,0,56
8,46,1,120,249,0,0,144,0,0.8,2,0,3,0,47
9,54,1,122,286,0,0,116,1,3.2,1,2,2,0,55


In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1025 entries, 0 to 1024
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1025 non-null   int64  
 1   sex       1025 non-null   int64  
 2   trestbps  1025 non-null   int64  
 3   chol      1025 non-null   int64  
 4   fbs       1025 non-null   int64  
 5   restecg   1025 non-null   int64  
 6   thalach   1025 non-null   int64  
 7   exang     1025 non-null   int64  
 8   oldpeak   1025 non-null   float64
 9   slope     1025 non-null   int64  
 10  ca        1025 non-null   int64  
 11  thal      1025 non-null   int64  
 12  target    1025 non-null   int64  
 13  demo_01   1025 non-null   int64  
dtypes: float64(1), int64(13)
memory usage: 112.2 KB


In [15]:
# Fill missing values
# df['thalach'] = df['thalach'].fillna(0) # # Replace missing values with 0

# df['thalach'] = df['thalach'].fillna(df['thalach'].mean())  # Replace missing values with the mean

# df['thalach'] = df['thalach'].fillna(df['thalach'].median())  # Replace missing values with the median

# df['thalach'] = df['thalach'].fillna(df['thalach'].mode()[0])  # Replace missing values with the mode


In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1025 entries, 0 to 1024
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1025 non-null   int64  
 1   sex       1025 non-null   int64  
 2   trestbps  1025 non-null   int64  
 3   chol      1025 non-null   int64  
 4   fbs       1025 non-null   int64  
 5   restecg   1025 non-null   int64  
 6   thalach   1025 non-null   int64  
 7   exang     1025 non-null   int64  
 8   oldpeak   1025 non-null   float64
 9   slope     1025 non-null   int64  
 10  ca        1025 non-null   int64  
 11  thal      1025 non-null   int64  
 12  target    1025 non-null   int64  
 13  demo_01   1025 non-null   int64  
dtypes: float64(1), int64(13)
memory usage: 112.2 KB


In [17]:
# Remove rows with missing values
df = df.dropna()

In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1025 entries, 0 to 1024
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1025 non-null   int64  
 1   sex       1025 non-null   int64  
 2   trestbps  1025 non-null   int64  
 3   chol      1025 non-null   int64  
 4   fbs       1025 non-null   int64  
 5   restecg   1025 non-null   int64  
 6   thalach   1025 non-null   int64  
 7   exang     1025 non-null   int64  
 8   oldpeak   1025 non-null   float64
 9   slope     1025 non-null   int64  
 10  ca        1025 non-null   int64  
 11  thal      1025 non-null   int64  
 12  target    1025 non-null   int64  
 13  demo_01   1025 non-null   int64  
dtypes: float64(1), int64(13)
memory usage: 112.2 KB


### [SECTION 8: Saving Data]

**Script:** "After performing your analysis or data manipulation, you might want to save the results for future use. Pandas makes it easy to save your DataFrame to a file, whether it's a CSV, Excel, or another format.

To save your DataFrame as a CSV file, use the `to_csv()` method. For example, `df.to_csv('new_dataset.csv', index=False)` will save the DataFrame to a file named `'new_dataset.csv'`. The index=False parameter ensures that the row indices are not included in the output file.

Similarly, you can save your DataFrame as an Excel file using the `to_excel()` method. For instance, `df.to_excel('new_dataset.xlsx', index=False)` will save the DataFrame to an Excel file. This is particularly useful if you need to share your data with others who may not be using Python."

Visual: Show the saved files in a file explorer.

In [19]:
# Save the dataframe to a new CSV file
df.to_csv('../New Dataset/new_dataset.csv', index=False)

# Save the dataframe to an Excel file
df.to_excel('../New Dataset/new_dataset.xlsx', index=False)


OSError: Cannot save file into a non-existent directory: '..\New Dataset'

## [OUTRO]

**Script:** "And that wraps up our introduction to Pandas! Today, we've covered how to load, explore, manipulate, and save data using Pandas. I hope you found this tutorial helpful and that it gives you the confidence to start using Pandas in your own projects.

If you missed the previous videos on setting up your environment, be sure to check them outâ€”they'll give you a great foundation for working with Pandas and other Python libraries. Thanks for watching, and I'll see you in the next video!"

Visual: Show the outro slide with social media handles, a call-to-action, and references to previous videos.