# Data Acquisition & Understanding (with SQL Integration)

###### This step involves identifying, acquiring, and getting a first look at your data. In a real-world setting, data often resides in databases. For this project, we'll simulate that by using Python's sqlite3 to load your CSV into a temporary database and then query it.

In [3]:
# Import Libraries:
import pandas as pd
import sqlite3

In [4]:
# Load the CSV into a Pandas DataFrame first:
csv_path = 'D:\My Data\Family Storages\Rudra\Education\Projects\Project 2 telco custimer churn\Data\Telco-Customer-Churn.csv'
# Replace 'path/to/your/downloaded/WA_Fn-UseC_-Telco-Customer-Churn.csv' with the actual path
df_raw = pd.read_csv(csv_path)

print("Raw DataFrame Info:")
df_raw.info()
print("\nRaw DataFrame Head:")
print(df_raw.head())

Raw DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043

  csv_path = 'D:\My Data\Family Storages\Rudra\Education\Projects\Project 2 telco custimer churn\Data\Telco-Customer-Churn.csv'


In [5]:
# check the data top 5 heads

df_raw.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [6]:
# Check the number of rows and columns.
df_raw.shape

(7043, 21)

In [7]:
# Get a summary of the DataFrame, including data types and non-null values. This is crucial for identifying potential missing data and incorrect data types.
df_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


In [8]:
# Generate descriptive statistics for numerical columns (count, mean, std, min, max, quartiles).

df_raw.describe()

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges
count,7043.0,7043.0,7043.0
mean,0.162147,32.371149,64.761692
std,0.368612,24.559481,30.090047
min,0.0,0.0,18.25
25%,0.0,9.0,35.5
50%,0.0,29.0,70.35
75%,0.0,55.0,89.85
max,1.0,72.0,118.75


In [9]:
# List all column names.

df_raw.columns

Index(['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents',
       'tenure', 'PhoneService', 'MultipleLines', 'InternetService',
       'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
       'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
       'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn'],
      dtype='object')

In [10]:
# To get a count of missing values for each column.

df_raw.isnull().sum()

customerID          0
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64

In [11]:
# Check for duplicate rows.

df_raw.duplicated().sum()

0

#### Create a SQLite database and load the DataFrame into a table:

In [13]:
# Connect to (or create) a SQLite database
conn = sqlite3.connect('Telco-Customer-Churn.db')

# Write the raw DataFrame to a SQL table
# if_exists='replace' will overwrite the table if it already exists
# index=False prevents Pandas from writing the DataFrame index as a column
df_raw.to_sql('telco_churn', conn, if_exists='replace', index=False)

print("\nSuccessfully loaded CSV into SQLite database 'Telco-Customer-Churn.db' as table 'telco_churn'.")


Successfully loaded CSV into SQLite database 'Telco-Customer-Churn.db' as table 'telco_churn'.


#### Initial Data Extraction & Inspection using SQL (from the simulated DB):

In [15]:
sql_query_select_all = "SELECT * FROM telco_churn;"
df_from_sql = pd.read_sql(sql_query_select_all, conn)

print("\nDataFrame loaded from SQL query:")
print(df_from_sql.head())
print(df_from_sql.info())


DataFrame loaded from SQL query:
   customerID  gender  SeniorCitizen Partner Dependents  tenure PhoneService  \
0  7590-VHVEG  Female              0     Yes         No       1           No   
1  5575-GNVDE    Male              0      No         No      34          Yes   
2  3668-QPYBK    Male              0      No         No       2          Yes   
3  7795-CFOCW    Male              0      No         No      45           No   
4  9237-HQITU  Female              0      No         No       2          Yes   

      MultipleLines InternetService OnlineSecurity  ... DeviceProtection  \
0  No phone service             DSL             No  ...               No   
1                No             DSL            Yes  ...              Yes   
2                No             DSL            Yes  ...               No   
3  No phone service             DSL            Yes  ...              Yes   
4                No     Fiber optic             No  ...               No   

  TechSupport StreamingTV St

#### Perform an initial SQL-based data quality check:

##### Check for missing values (e.g., in TotalCharges) directly in SQL if you prefer, before pulling into Pandas.
##### Get counts of unique values for categorical columns.

In [18]:
# Example SQL for initial data quality check (count non-nulls for TotalCharges)
sql_check_total_charges = "SELECT COUNT(*) FROM telco_churn WHERE TotalCharges IS NULL OR TotalCharges = '';"
missing_total_charges = pd.read_sql(sql_check_total_charges, conn).iloc[0,0]
print(f"\nMissing or empty TotalCharges rows (from SQL): {missing_total_charges}")

# Example SQL to check unique values for a categorical column
sql_check_contract = "SELECT DISTINCT Contract FROM telco_churn;"
unique_contracts = pd.read_sql(sql_check_contract, conn)
print("\nUnique Contract types (from SQL):")
print(unique_contracts)


Missing or empty TotalCharges rows (from SQL): 0

Unique Contract types (from SQL):
         Contract
0  Month-to-month
1        One year
2        Two year


##### Outcomes:
###### Successfully acquired the "Telco Customer Churn" dataset.
###### Demonstrated the real-world process of loading data into a database (simulated with SQLite) and querying it using SQL.
###### An initial Pandas DataFrame df_from_sql populated from your SQL query, and a preliminary understanding of its structure, data types, and initial data quality observations.
###### The target variable (Churn) and potential features are clearly identified.