Python code for data loading and analysis - 
Creating the DataDictionary

In [19]:
import pandas as pd
# Load the dataset
df = pd.read_csv('Telco-Customer-Churn.csv')
# Get data structure overview
data_dictionary = pd.DataFrame({
    "Attribute Name": df.columns,
    "Data Type": df.dtypes.values,
    "Missing Values": df.isnull().sum().values,
    "Unique Values": df.nunique().values
})
# Display the data dictionary
#print("Data Dictionary Overview:")
#print(data_dictionary)
from IPython.display import display
# Display Data Dictionary in Jupyter Notebook
display(data_dictionary)

Unnamed: 0,Attribute Name,Data Type,Missing Values,Unique Values
0,customerID,object,0,7043
1,gender,object,0,2
2,SeniorCitizen,int64,0,2
3,Partner,object,0,2
4,Dependents,object,0,2
5,tenure,int64,0,73
6,PhoneService,object,0,2
7,MultipleLines,object,0,3
8,InternetService,object,0,3
9,OnlineSecurity,object,0,3


Data Ingestion - with Pandas

In [20]:
#import pandas as pd

# Load the dataset into a Pandas DataFrame
df = pd.read_csv('Telco-Customer-Churn.csv')
# Display first few rows
df.head(10)


Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes
5,9305-CDSKC,Female,0,No,No,8,Yes,Yes,Fiber optic,No,...,Yes,No,Yes,Yes,Month-to-month,Yes,Electronic check,99.65,820.5,Yes
6,1452-KIOVK,Male,0,No,Yes,22,Yes,Yes,Fiber optic,No,...,No,No,Yes,No,Month-to-month,Yes,Credit card (automatic),89.1,1949.4,No
7,6713-OKOMC,Female,0,No,No,10,No,No phone service,DSL,Yes,...,No,No,No,No,Month-to-month,No,Mailed check,29.75,301.9,No
8,7892-POOKP,Female,0,Yes,No,28,Yes,Yes,Fiber optic,No,...,Yes,Yes,Yes,Yes,Month-to-month,Yes,Electronic check,104.8,3046.05,Yes
9,6388-TABGU,Male,0,No,Yes,62,Yes,No,DSL,Yes,...,No,No,No,No,One year,No,Bank transfer (automatic),56.15,3487.95,No


Checking for missing values using Pandas

In [22]:
missing_values = df.isnull().sum()
print("Missing Values:\n", missing_values)

Missing Values:
 customerID          0
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64


Statistical summary of numeric columns - helps in understanding the range and distribution of numerical columns

In [24]:
numeric_summary = df.describe()
print("Statistical Summary:\n", numeric_summary)

Statistical Summary:
        SeniorCitizen       tenure  MonthlyCharges
count    7043.000000  7043.000000     7043.000000
mean        0.162147    32.371149       64.761692
std         0.368612    24.559481       30.090047
min         0.000000     0.000000       18.250000
25%         0.000000     9.000000       35.500000
50%         0.000000    29.000000       70.350000
75%         0.000000    55.000000       89.850000
max         1.000000    72.000000      118.750000


Check for Duplicate  records - to ensure that there is no redundant data

In [25]:
duplicate_count = df.duplicated().sum()
print("\nTotal Duplicate Records:", duplicate_count)


Total Duplicate Records: 0


Check Data Types and Unique Values - to identify categorical vs numerical columns and spot inconsistencies

In [31]:
data_info = pd.DataFrame({
    "Data Type": df.dtypes,
    "Unique Values": df.nunique()
})

print("\nData Types and Unique Values:\n", data_info)


Data Types and Unique Values:
                  Data Type  Unique Values
customerID          object           7043
gender              object              2
SeniorCitizen        int64              2
Partner             object              2
Dependents          object              2
tenure               int64             73
PhoneService        object              2
MultipleLines       object              3
InternetService     object              3
OnlineSecurity      object              3
OnlineBackup        object              3
DeviceProtection    object              3
TechSupport         object              3
StreamingTV         object              3
StreamingMovies     object              3
Contract            object              3
PaperlessBilling    object              2
PaymentMethod       object              4
MonthlyCharges     float64           1585
TotalCharges       float64           6530
Churn               object              2


In [28]:
print("\n Data Type of 'TotalCharges':", df["TotalCharges"].dtype)
invalid_values = df[~df["TotalCharges"].str.replace(' ', '').str.isnumeric()]
print("\n Non-Numeric Values in 'TotalCharges':")
print(invalid_values[["customerID", "TotalCharges"]].head(10))
invalid_count = invalid_values.shape[0]
print(f"\n Found {invalid_count} rows with non-numeric values in 'TotalCharges'!")


 Data Type of 'TotalCharges': object

 Non-Numeric Values in 'TotalCharges':
   customerID TotalCharges
0  7590-VHVEG        29.85
1  5575-GNVDE       1889.5
2  3668-QPYBK       108.15
3  7795-CFOCW      1840.75
4  9237-HQITU       151.65
5  9305-CDSKC        820.5
6  1452-KIOVK       1949.4
7  6713-OKOMC        301.9
8  7892-POOKP      3046.05
9  6388-TABGU      3487.95

 Found 6719 rows with non-numeric values in 'TotalCharges'!


The data before and after cleaning

In [30]:
print("Before Cleaning:")
print(df["TotalCharges"].head(10))
# - Convert spaces (' ') or empty values ('') to NaN
# - Convert valid numbers to float type
df["TotalCharges"] = pd.to_numeric(df["TotalCharges"], errors='coerce')

# ✅ Step 4: Display Cleaned `TotalCharges` Column
print("\nAfter Cleaning:")
print(df["TotalCharges"].head(10))

# ✅ Step 5: Save the Cleaned Data Back to a CSV File
cleaned_file_path = "Telco-Customer-Churn-Cleaned.csv"
df.to_csv(cleaned_file_path, index=False)

print("\n✅ Data Cleaning Completed and Saved to:", cleaned_file_path)

Before Cleaning:
0      29.85
1    1889.50
2     108.15
3    1840.75
4     151.65
5     820.50
6    1949.40
7     301.90
8    3046.05
9    3487.95
Name: TotalCharges, dtype: float64

After Cleaning:
0      29.85
1    1889.50
2     108.15
3    1840.75
4     151.65
5     820.50
6    1949.40
7     301.90
8    3046.05
9    3487.95
Name: TotalCharges, dtype: float64

✅ Data Cleaning Completed and Saved to: Telco-Customer-Churn-Cleaned.csv


Achieving Data Analysis and Data Cleaning using SQL

In [23]:

#import pandas as pd
from sqlalchemy import create_engine
import psycopg2

# Load the dataset into a Pandas DataFrame
df = pd.read_csv('Telco-Customer-Churn.csv')

# Define PostgreSQL connection using default credentials
db_engine = create_engine('postgresql://postgres:postgres@localhost:5432/postgres')

# Load DataFrame into PostgreSQL (Table: telco_customer_churn)
df.to_sql('telco_customer_churn', db_engine, if_exists='replace', index=False)

print("✅ Data successfully loaded into PostgreSQL database!")

# Connect to PostgreSQL to execute a query
try:
    conn = psycopg2.connect("dbname='postgres' user='postgres' password='postgres' host='localhost' port='5432'")
    cursor = conn.cursor()

    # Execute SQL query to fetch first 5 rows
    query = "SELECT * FROM telco_customer_churn LIMIT 5;"
    cursor.execute(query)

    # Fetch and display results
    rows = cursor.fetchall()
    print("\n📊 First 5 Rows from telco_customer_churn Table:")
    for row in rows:
        print(row)

    # Close connection
    cursor.close()
    conn.close()
except Exception as e:
    print("❌ Error connecting to PostgreSQL:", e)


✅ Data successfully loaded into PostgreSQL database!

📊 First 5 Rows from telco_customer_churn Table:
('7590-VHVEG', 'Female', 0, 'Yes', 'No', 1, 'No', 'No phone service', 'DSL', 'No', 'Yes', 'No', 'No', 'No', 'No', 'Month-to-month', 'Yes', 'Electronic check', 29.85, '29.85', 'No')
('5575-GNVDE', 'Male', 0, 'No', 'No', 34, 'Yes', 'No', 'DSL', 'Yes', 'No', 'Yes', 'No', 'No', 'No', 'One year', 'No', 'Mailed check', 56.95, '1889.5', 'No')
('3668-QPYBK', 'Male', 0, 'No', 'No', 2, 'Yes', 'No', 'DSL', 'Yes', 'Yes', 'No', 'No', 'No', 'No', 'Month-to-month', 'Yes', 'Mailed check', 53.85, '108.15', 'Yes')
('7795-CFOCW', 'Male', 0, 'No', 'No', 45, 'No', 'No phone service', 'DSL', 'Yes', 'No', 'Yes', 'Yes', 'No', 'No', 'One year', 'No', 'Bank transfer (automatic)', 42.3, '1840.75', 'No')
('9237-HQITU', 'Female', 0, 'No', 'No', 2, 'Yes', 'No', 'Fiber optic', 'No', 'No', 'No', 'No', 'No', 'No', 'Month-to-month', 'Yes', 'Electronic check', 70.7, '151.65', 'Yes')


finding missing values using SQL

In [26]:
import pandas as pd
from sqlalchemy import create_engine
import psycopg2

# Load the dataset into a Pandas DataFrame
df = pd.read_csv("Telco-Customer-Churn.csv")

# Define PostgreSQL connection using default credentials
db_engine = create_engine('postgresql://postgres:postgres@localhost:5432/postgres')

# Load DataFrame into PostgreSQL (Table: telco_customer_churn)
df.to_sql('telco_customer_churn', db_engine, if_exists='replace', index=False)

print("✅ Data successfully loaded into PostgreSQL database!")

missing_values = df.isnull().sum()
print("Missing Values:\n", missing_values)
# Convert missing values count to a DataFrame for better visualization
missing_df = missing_values.to_frame(name="Missing Count")
display(missing_df)


✅ Data successfully loaded into PostgreSQL database!
Missing Values:
 customerID          0
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64


Unnamed: 0,Missing Count
customerID,0
gender,0
SeniorCitizen,0
Partner,0
Dependents,0
tenure,0
PhoneService,0
MultipleLines,0
InternetService,0
OnlineSecurity,0


Doing the above data analysis steps using SQL code

In [15]:
import psycopg2
import pandas as pd

# ✅ Step 1: Connect to PostgreSQL
conn = psycopg2.connect("dbname='postgres' user='postgres' password='postgres' host='localhost' port='5432'")
cursor = conn.cursor()

# ✅ Step 2: Statistical Summary of Numerical Columns
statistical_query = """
SELECT 
    column_name, 
    COUNT(*) AS count,
    AVG(column_value::numeric) AS mean,
    STDDEV(column_value::numeric) AS std_dev,
    MIN(column_value::numeric) AS min,
    PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY column_value::numeric) AS Q1,
    PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY column_value::numeric) AS median,
    PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY column_value::numeric) AS Q3,
    MAX(column_value::numeric) AS max
FROM (
    SELECT 'SeniorCitizen' AS column_name, "SeniorCitizen"::TEXT AS column_value FROM telco_customer_churn
    UNION ALL
    SELECT 'tenure', "tenure"::TEXT FROM telco_customer_churn
    UNION ALL
    SELECT 'MonthlyCharges', "MonthlyCharges"::TEXT FROM telco_customer_churn
) AS numeric_data
GROUP BY column_name;
"""

cursor.execute(statistical_query)
stat_summary = cursor.fetchall()
df_stat_summary = pd.DataFrame(stat_summary, columns=['Column', 'Count', 'Mean', 'Std Dev', 'Min', 'Q1', 'Median', 'Q3', 'Max'])

# ✅ Step 3: Check for Duplicate Records
duplicate_query = """
SELECT COUNT(*) AS duplicate_records
FROM (
    SELECT "customerID", COUNT(*)
    FROM telco_customer_churn
    GROUP BY "customerID"
    HAVING COUNT(*) > 1
) AS duplicates;
"""

cursor.execute(duplicate_query)
duplicate_count = cursor.fetchone()[0]

# ✅ Step 4: Check Data Types & Unique Values
data_info_query = """
SELECT 
    column_name, 
    data_type,
    (SELECT COUNT(DISTINCT column_name) FROM information_schema.columns WHERE table_name = 'telco_customer_churn') AS unique_values
FROM information_schema.columns
WHERE table_name = 'telco_customer_churn';
"""

cursor.execute(data_info_query)
data_info = cursor.fetchall()
df_data_info = pd.DataFrame(data_info, columns=['Column', 'Data Type', 'Unique Values'])

# ✅ Step 5: Convert `TotalCharges` to Numeric (Fixing Whitespace Issue)
convert_query = """
UPDATE telco_customer_churn
SET "TotalCharges" = NULL
WHERE TRIM("TotalCharges") = '';

ALTER TABLE telco_customer_churn 
ALTER COLUMN "TotalCharges" TYPE NUMERIC 
USING NULLIF(TRIM("TotalCharges"), '')::NUMERIC;
"""
cursor.execute(convert_query)
conn.commit()


# ✅ Step 6: Close the connection
cursor.close()
conn.close()

# ✅ Step 7: Display Results
print("\n📊 Statistical Summary:")
print(df_stat_summary)

print("\n🛑 Duplicate Records Found:", duplicate_count)

print("\n🧐 Data Types & Unique Values:")
print(df_data_info)


📊 Statistical Summary:
           Column  Count                    Mean                 Std Dev  \
0  MonthlyCharges   7043     64.7616924605991765     30.0900470976784905   
1   SeniorCitizen   7043  0.16214681243788158455  0.36861160561001307794   
2          tenure   7043     32.3711486582422263     24.5594810230944587   

     Min    Q1  Median     Q3     Max  
0  18.25  35.5   70.35  89.85  118.75  
1      0   0.0    0.00   0.00       1  
2      0   9.0   29.00  55.00      72  

🛑 Duplicate Records Found: 0

🧐 Data Types & Unique Values:
              Column         Data Type  Unique Values
0     MonthlyCharges  double precision             21
1      SeniorCitizen            bigint             21
2             tenure            bigint             21
3         Dependents              text             21
4       PhoneService              text             21
5      MultipleLines              text             21
6    InternetService              text             21
7     OnlineSecuri

In [17]:
import psycopg2
import pandas as pd
from sqlalchemy import create_engine

# ✅ Step 1: Connect to PostgreSQL using SQLAlchemy
engine = create_engine("postgresql://postgres:postgres@localhost:5432/postgres")

# ✅ Step 2: Load Data into a Pandas DataFrame
query = "SELECT * FROM telco_customer_churn;"
df = pd.read_sql(query, engine)

# ✅ Step 3: Inspect the `TotalCharges` column
print("Before Cleaning:")
print(df["TotalCharges"].head(10))

# ✅ Step 4: Convert `TotalCharges` to Numeric
# - Replace spaces (' ') or empty values ('') with NaN
# - Convert the column to numeric (float)
df["TotalCharges"] = pd.to_numeric(df["TotalCharges"], errors='coerce')

# ✅ Step 5: Display cleaned column
print("\nAfter Cleaning:")
print(df["TotalCharges"].head(10))

# ✅ Step 6: Write Cleaned Data Back to PostgreSQL
df.to_sql('telco_customer_churn_cleaned', engine, if_exists='replace', index=False)

print("\n✅ Data Cleaning Completed and Saved in PostgreSQL!")


Before Cleaning:
0      29.85
1    1889.50
2     108.15
3    1840.75
4     151.65
5     820.50
6    1949.40
7     301.90
8    3046.05
9    3487.95
Name: TotalCharges, dtype: float64

After Cleaning:
0      29.85
1    1889.50
2     108.15
3    1840.75
4     151.65
5     820.50
6    1949.40
7     301.90
8    3046.05
9    3487.95
Name: TotalCharges, dtype: float64

✅ Data Cleaning Completed and Saved in PostgreSQL!
