In [1]:
import pandas as pd  
import numpy as np 
import warnings 
import re

pd.set_option('display.float_format', lambda x: '%.2f' % x) 
print("......Libraries Loaded Successfully.........")

......Libraries Loaded Successfully.........


In [7]:
pd.set_option('display.max_columns', 10) 
filepath = "./Datasets/Lung Cancer Survey.csv"
df = pd.read_csv(filepath)
df.sample(10)

Unnamed: 0,GENDER,AGE,SMOKING,YELLOW_FINGERS,ANXIETY,...,COUGHING,SHORTNESS OF BREATH,SWALLOWING DIFFICULTY,CHEST PAIN,LUNG_CANCER
154,Female,64,YES,YES,NO,...,NO,NO,NO,NO,NO
182,Male,71,NO,NO,NO,...,YES,YES,NO,YES,YES
156,Male,47,YES,YES,NO,...,NO,NO,NO,YES,NO
135,Male,69,NO,NO,YES,...,YES,YES,YES,NO,YES
259,Male,58,NO,NO,NO,...,NO,NO,YES,NO,YES
19,Female,61,NO,NO,NO,...,NO,YES,NO,NO,NO
258,Male,70,YES,NO,YES,...,YES,YES,NO,YES,YES
145,Female,65,YES,YES,YES,...,YES,YES,YES,NO,YES
51,Male,63,YES,YES,YES,...,NO,YES,NO,NO,YES
218,Female,70,NO,NO,NO,...,YES,YES,NO,NO,YES


In [8]:
df.shape

(309, 16)

In [9]:
def rename_column(text):                      
    text = re.sub(r'[^\w\s]', '_', text)
    text = text.title()
    return text

df.columns = df.columns.to_series().apply(rename_column)
for column in df.columns.tolist(): print(f"{'-'*25} {column}")

------------------------- Gender
------------------------- Age
------------------------- Smoking
------------------------- Yellow_Fingers
------------------------- Anxiety
------------------------- Peer_Pressure
------------------------- Chronic Disease
------------------------- Fatigue 
------------------------- Allergy 
------------------------- Wheezing
------------------------- Alcohol Consuming
------------------------- Coughing
------------------------- Shortness Of Breath
------------------------- Swallowing Difficulty
------------------------- Chest Pain
------------------------- Lung_Cancer


<h4 style='font-size: 18px; color: blue; font-family: Colonna MT; font-weight: 600'>4.3: Handling Duplicates Values</h4>

In [13]:
def HandlingDuplicates(df):
    Duplicates = df.duplicated().sum()
    if Duplicates != 0:
        df.drop_duplicates(inplace=True)
        return print(f'Dataset has {Duplicates} Duplicates and values was cleaned successifully....')
    else:
        print('Dataset has no Duplictes values')
HandlingDuplicates(df)

Dataset has 33 Duplicates and values was cleaned successifully....


<h4 style='font-size: 18px; color: blue; font-family: Candara; font-weight: 600'>1.4: Columns Summary</h4>

To begin the analysis, it is important to explore the dataset by summarizing its structure and key attributes. This involves examining the **data types (dtypes)** of each column to determine whether they contain numerical or categorical values, which helps in selecting appropriate analytical techniques. Additionally, checking the **number of unique values** in each column provides insight into the variability of the data, distinguishing between continuous and discrete features.  

Assessing **distinct values** allows for a better understanding of the diversity within each variable, while identifying **missing values** is essential to evaluate data completeness and potential gaps that may require handling. Lastly, reviewing the **count of non-null entries** ensures the dataset’s integrity and helps in deciding whether any preprocessing steps, such as data imputation or cleaning, are necessary. This exploratory step lays the foundation for effective analysis and meaningful insights.

In [10]:
def column_summary(df):
    summary_data = []
    for col_name in df.columns:
        col_dtype = df[col_name].dtype
        num_of_nulls = df[col_name].isnull().sum()
        num_of_non_nulls = df[col_name].notnull().sum()
        num_of_distinct_values = df[col_name].nunique()
        
        if num_of_distinct_values <= 10:
            distinct_values_counts = df[col_name].value_counts().to_dict()
        else:
            top_10_values_counts = df[col_name].value_counts().head(10).to_dict()
            distinct_values_counts = {k: v for k, v in sorted(top_10_values_counts.items(), key=lambda item: item[1], reverse=True)}

        summary_data.append({
            'col_name': col_name,
            'col_dtype': col_dtype,
            'num_of_nulls': num_of_nulls,
            'num_of_non_nulls': num_of_non_nulls,
            'num_of_distinct_values': num_of_distinct_values,
            'distinct_values_counts': distinct_values_counts
        })
    
    summary_df = pd.DataFrame(summary_data)
    return summary_df

summary_df = column_summary(df)
display(summary_df)

Unnamed: 0,col_name,col_dtype,num_of_nulls,num_of_non_nulls,num_of_distinct_values,distinct_values_counts
0,Gender,object,0,309,2,"{'Male': 162, 'Female': 147}"
1,Age,int64,0,309,39,"{64: 20, 63: 19, 56: 19, 62: 18, 60: 17, 61: 1..."
2,Smoking,object,0,309,2,"{'YES': 174, 'NO': 135}"
3,Yellow_Fingers,object,0,309,2,"{'YES': 176, 'NO': 133}"
4,Anxiety,object,0,309,2,"{'NO': 155, 'YES': 154}"
5,Peer_Pressure,object,0,309,2,"{'YES': 155, 'NO': 154}"
6,Chronic Disease,object,0,309,2,"{'YES': 156, 'NO': 153}"
7,Fatigue,object,0,309,2,"{'YES': 208, 'NO': 101}"
8,Allergy,object,0,309,2,"{'YES': 172, 'NO': 137}"
9,Wheezing,object,0,309,2,"{'YES': 172, 'NO': 137}"


<h4 style='font-size: 18px; color: blue;  font-family: Colonna MT; font-weight: 600'>4.6: Exploring Invalid Entries Dtypes</h4>

Exploring invalid entries in data types involves identifying values that do not match the expected format or category within each column. This includes detecting inconsistencies such as numerical values in categorical fields, incorrect data formats, or unexpected symbols and typos. Invalid entries can lead to errors in analysis and model performance, making it essential to standardize data types and correct anomalies.

In [11]:
def simplify_dtype(dtype):
    if dtype in (int, float, np.number): return 'Numeric'
    elif np.issubdtype(dtype, np.datetime64): return 'Datetime'
    elif dtype == str: return 'String'
    elif dtype == type(None): return 'Missing'
    else: return 'Other'

def analyze_column_dtypes(df):
    all_dtypes = {'Numeric', 'Datetime', 'String', 'Missing', 'Other'}
    results = pd.DataFrame(index=df.columns, columns=list(all_dtypes), dtype=object).fillna('-')
    
    for column in df.columns:
        dtypes = df[column].apply(lambda x: simplify_dtype(type(x))).value_counts()
        percentages = (dtypes / len(df)) * 100
        for dtype, percent in percentages.items():
            if percent > 0:
                results.at[column, dtype] = f'{percent:.2f}%'  # Add % sign and format to 2 decimal places
            else:
                results.at[column, dtype] = '-'  # Add dash for 0%
    return results

results = analyze_column_dtypes(df)
display(results)

Unnamed: 0,Numeric,Datetime,Other,String,Missing
Gender,-,-,-,100.00%,-
Age,100.00%,-,-,-,-
Smoking,-,-,-,100.00%,-
Yellow_Fingers,-,-,-,100.00%,-
Anxiety,-,-,-,100.00%,-
Peer_Pressure,-,-,-,100.00%,-
Chronic Disease,-,-,-,100.00%,-
Fatigue,-,-,-,100.00%,-
Allergy,-,-,-,100.00%,-
Wheezing,-,-,-,100.00%,-


<h4 style='font-size: 18px; color: Blue; font-family: Colonna MT; font-weight: 600'>4.4: Checking Missing Values</h4>

Checking for missing values is a crucial step in data analysis to assess the completeness and reliability of the dataset. This involves identifying any columns with null or empty entries, which may affect the accuracy of statistical and machine learning models.

In [12]:
def Missig_values_info(df):   
    isna_df = df.isna().sum().reset_index(name='Missing Values Counts')
    isna_df['Proportions (%)'] = isna_df['Missing Values Counts']/len(df)*100
    return isna_df
    
isna_df = Missig_values_info(df)
isna_df

Unnamed: 0,index,Missing Values Counts,Proportions (%)
0,Gender,0,0.0
1,Age,0,0.0
2,Smoking,0,0.0
3,Yellow_Fingers,0,0.0
4,Anxiety,0,0.0
5,Peer_Pressure,0,0.0
6,Chronic Disease,0,0.0
7,Fatigue,0,0.0
8,Allergy,0,0.0
9,Wheezing,0,0.0


---

This analysis was performed by **Jabulente**, a passionate and dedicated data scientist with a strong commitment to using data to drive meaningful insights and solutions. For inquiries, collaborations, or further discussions, please feel free to reach out via.  

    
<div align="center">  
    
[![GitHub](https://img.shields.io/badge/GitHub-Jabulente-black?logo=github)](https://github.com/Jabulente)  [![LinkedIn](https://img.shields.io/badge/LinkedIn-Jabulente-blue?logo=linkedin)](https://linkedin.com/in/jabulente-208019349)  [![Email](https://img.shields.io/badge/Email-jabulente@hotmail.com-red?logo=gmail)](mailto:Jabulente@hotmail.com)  

</div>
