Certainly! Here’s an expanded list of relevant SWE concepts for this project:

- **VSCode**: Code editor for writing and debugging Python scripts locally.
- **Google Colab**: Cloud-based environment for running Python code without local setup.
- **Jupyter Notebooks**: Interactive notebooks for writing and executing Python code in chunks.
- **GitHub**: Repository for version control, collaboration, and storing the project files.
- **Git**: Tool for version control and managing code revisions (commits, branches, merges).
- **Libraries**: External Python modules like Pandas, NumPy, Matplotlib, and Seaborn for data manipulation, computation, and visualization.
- **Functions**: Blocks of reusable code for tasks like data preprocessing, cleaning, or analysis.
- **Terminal**: Command-line interface for executing scripts, installing libraries (via pip), and using Git commands.
- **Kernel**: Computational engine in Jupyter or Colab notebooks that runs code blocks and displays output.
- **Error messages**: System-provided diagnostics when code fails, aiding debugging and problem-solving.
- **DataFrames**: 2D data structure used in Pandas to store and manipulate tabular data.
- **Type Checking**: Ensuring correct data types for columns in your dataset (like converting to categorical or numeric types).
- **Data Validation**: Process of checking data quality and integrity (handling missing or duplicate values).
- **Pip**: Python package manager for installing libraries.
- **Virtual Environments**: Isolated Python environments to manage project dependencies without conflicts.
- **Version Control**: Managing changes to the codebase with Git (committing, branching, pulling, merging).
- **Data Preprocessing**: Cleaning and preparing raw data for analysis (handling missing data, normalizing values).
- **Binning**: Grouping continuous data into discrete intervals (e.g., creating age groups for survival analysis).
- **Visualization**: Creating plots (like bar charts, scatter plots) to represent data using libraries like Matplotlib and Seaborn.
- **Loops & Conditionals**: Basic control structures for iterating over data and making decisions in code.
- **Data Wrangling**: Transforming and mapping raw data into a format more suitable for analysis.
- **CSV Handling**: Reading from and writing to CSV files using Pandas.
- **Project Structure**: Organizing code, data, and results in a logical folder and file structure.
- **Unit Testing**: Writing small tests to ensure that individual code components (e.g., functions) work as expected.
- **Documentation**: Writing clear comments and markdown cells in Jupyter Notebooks to explain code logic.
- **Collaboration Tools**: Using GitHub for team collaboration, pull requests, and code reviews.
- **IDE Shortcuts**: Navigating efficiently within VSCode or Colab using keyboard shortcuts and extensions.



In [3]:
# %pip install pandas
%pip install seaborn

Defaulting to user installation because normal site-packages is not writeable
Collecting seaborn
  Downloading seaborn-0.13.2-py3-none-any.whl (294 kB)
[K     |████████████████████████████████| 294 kB 3.7 MB/s eta 0:00:01
[?25hCollecting matplotlib!=3.6.1,>=3.4
  Downloading matplotlib-3.9.2-cp39-cp39-macosx_11_0_arm64.whl (7.8 MB)
[K     |████████████████████████████████| 7.8 MB 8.2 MB/s eta 0:00:01
Collecting importlib-resources>=3.2.0
  Downloading importlib_resources-6.4.5-py3-none-any.whl (36 kB)
Collecting kiwisolver>=1.3.1
  Downloading kiwisolver-1.4.7-cp39-cp39-macosx_11_0_arm64.whl (64 kB)
[K     |████████████████████████████████| 64 kB 11.9 MB/s eta 0:00:01
Collecting cycler>=0.10
  Downloading cycler-0.12.1-py3-none-any.whl (8.3 kB)
Collecting fonttools>=4.22.0
  Downloading fonttools-4.53.1-cp39-cp39-macosx_11_0_arm64.whl (2.2 MB)
[K     |████████████████████████████████| 2.2 MB 7.0 MB/s eta 0:00:01
[?25hCollecting pillow>=8
  Downloading pillow-10.4.0-cp39-cp39-maco

In [4]:
import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from matplotlib.ticker import StrMethodFormatter

Matplotlib is building the font cache; this may take a moment.


Special character

\ $ ?

In [71]:
data = pd.read_csv("./tested.csv")
print(data.head(5))
data.tail(4)

   PassengerId  Survived  Pclass  \
0          892         0       3   
1          893         1       3   
2          894         0       2   
3          895         0       3   
4          896         1       3   

                                           Name     Sex   Age  SibSp  Parch  \
0                              Kelly, Mr. James    male  34.5      0      0   
1              Wilkes, Mrs. James (Ellen Needs)  female  47.0      1      0   
2                     Myles, Mr. Thomas Francis    male  62.0      0      0   
3                              Wirz, Mr. Albert    male  27.0      0      0   
4  Hirvonen, Mrs. Alexander (Helga E Lindqvist)  female  22.0      1      1   

    Ticket     Fare Cabin Embarked  
0   330911   7.8292   NaN        Q  
1   363272   7.0000   NaN        S  
2   240276   9.6875   NaN        Q  
3   315154   8.6625   NaN        S  
4  3101298  12.2875   NaN        S  


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
414,1306,1,1,"Oliva y Ocana, Dona. Fermina",female,39.0,0,0,PC 17758,108.9,C105,C
415,1307,0,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,SOTON/O.Q. 3101262,7.25,,S
416,1308,0,3,"Ware, Mr. Frederick",male,,0,0,359309,8.05,,S
417,1309,0,3,"Peter, Master. Michael J",male,,1,1,2668,22.3583,,C


In [76]:
data.info()
# data.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Survived     418 non-null    int64  
 2   Pclass       418 non-null    int64  
 3   Name         418 non-null    object 
 4   Sex          418 non-null    object 
 5   Age          332 non-null    float64
 6   SibSp        418 non-null    int64  
 7   Parch        418 non-null    int64  
 8   Ticket       418 non-null    object 
 9   Fare         417 non-null    float64
 10  Cabin        91 non-null     object 
 11  Embarked     418 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 39.3+ KB


In [75]:
327+86+1

414

In [None]:
data.bfill
data.ffill
data.fillna()

In [39]:
data.duplicated().value_counts()
data.drop_duplicates

False    418
Name: count, dtype: int64

In [42]:
data.describe(include='all')

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
count,418.0,418.0,418.0,418,418,332.0,418.0,418.0,418,417.0,91,418
unique,,,,418,2,,,,363,,76,3
top,,,,"Kelly, Mr. James",male,,,,PC 17608,,B57 B59 B63 B66,S
freq,,,,1,266,,,,5,,3,270
mean,1100.5,0.363636,2.26555,,,30.27259,0.447368,0.392344,,35.627188,,
std,120.810458,0.481622,0.841838,,,14.181209,0.89676,0.981429,,55.907576,,
min,892.0,0.0,1.0,,,0.17,0.0,0.0,,0.0,,
25%,996.25,0.0,1.0,,,21.0,0.0,0.0,,7.8958,,
50%,1100.5,0.0,3.0,,,27.0,0.0,0.0,,14.4542,,
75%,1204.75,1.0,3.0,,,39.0,1.0,0.0,,31.5,,


In [48]:
data.groupby(['Pclass', 'Sex'])[['Embarked', 'Survived',]].value_counts()

Pclass  Sex     Embarked  Survived
1       female  C         1            28
                S         1            21
                Q         1             1
        male    S         0            29
                C         0            28
2       female  S         1            26
                C         1             4
        male    S         0            52
                C         0             7
                Q         0             4
3       female  S         1            41
                Q         1            23
                C         1             8
        male    S         0           101
                C         0            27
                Q         0            18
Name: count, dtype: int64

In [51]:
data.groupby('Survived')['Age'].count()

Survived
0    205
1    127
Name: Age, dtype: int64

In [57]:
type(True)

bool

In [53]:
data.Sex.value_counts()

Sex
male      266
female    152
Name: count, dtype: int64

In [69]:
age_bins = [0, 18, 40, 60, 100]  # Age intervals (0-18, 18-40, 40-60, 60+)
age_labels = ['<18', '18-40', '40-60', '60+']

# Create a new column 'AgeGroup' by binning the 'Age' column
data['AgeGroup'] = pd.cut(data['Age'], bins=age_bins, labels=age_labels, right=False)

# Calculate the survival rate for each age group
age_group_survival = data.groupby('AgeGroup')['Survived'].mean()

# Display the survival rates for each age group
age_group_survival

  age_group_survival = data.groupby('AgeGroup')['Survived'].mean()


AgeGroup
<18      0.414634
18-40    0.382775
40-60    0.338235
60+      0.500000
Name: Survived, dtype: float64

In [70]:
# Group by 'Pclass' and 'Sex' and calculate the mean survival rate for each group
gender_class_survival = data.groupby(['Pclass', 'Sex'])['Survived'].mean()

# Display the survival rates for each gender and class combination
gender_class_survival


Pclass  Sex   
1       female    1.0
        male      0.0
2       female    1.0
        male      0.0
3       female    1.0
        male      0.0
Name: Survived, dtype: float64
