Using Chicago Crime database that is in the form of a 1.8 GB csv file. This file represents one table, and I want to import it into a MySQL DB via the 'Import Data Tool', However, the Data Import currently fails because the formatting of the data is too inconsistent. Therefore, some cleaning and normalization is in order to help make it possible to import first. 

In [2]:
%pip install -r requirements.txt

Note: you may need to restart the kernel to use updated packages.


In [3]:
import csv
import pandas as pd
import datetime as datetime

To set up for later steps, I'm going to pull out all of the column names present in the Chicago DB Crimes .csv file. 

In [4]:
def get_column_names(input_file):
    """
    Extracts column names from the header of a CSV file.

    Parameters:
    - input_file (str): Path to the CSV file.

    Returns:
    - list: A list of column names.
    """
    with open(input_file, mode='r', encoding='utf-8') as infile:
        reader = csv.reader(infile)
        header = next(reader, None)  # Read the first row as the header
        if header:
            return [col.strip() for col in header]  # Remove any extra spaces
        else:
            raise ValueError("The CSV file does not contain a header row.")

# Replace with your file path
input_csv = 'Crimes_-_2001_to_Present.csv'

try:
    column_names = get_column_names(input_csv)
    print("Column names in the CSV file:")
    print(column_names)
except Exception as e:
    print(f"Error: {e}")

Column names in the CSV file:
['ID', 'Case Number', 'Date', 'Block', 'IUCR', 'Primary Type', 'Description', 'Location Description', 'Arrest', 'Domestic', 'Beat', 'District', 'Ward', 'Community Area', 'FBI Code', 'X Coordinate', 'Y Coordinate', 'Year', 'Updated On', 'Latitude', 'Longitude', 'Location']


When attempting to import data via the Import Data Tool in DBeaver (Open Source Universal DB Tool), I found that some rows that had a boolean values wheren't consistent, so here I'm trying to find every value thats 'bad' so I know what I need to find and replace. 

In [5]:

def find_unique_values(input_file, target_column):
    """
    Reads a CSV file and collects all unique values in the specified column.

    Parameters:
    - input_file (str): Path to the CSV file.
    - target_column (str): Name of the column to analyze.

    Returns:
    - set: A set of unique values found in the specified column.
    """
    unique_values = set()

    with open(input_file, mode='r', encoding='utf-8') as infile:
        reader = csv.DictReader(infile)
        
        if target_column not in reader.fieldnames:
            raise ValueError(f"Column '{target_column}' not found in the CSV file.")
        
        for row in reader:
            value = row.get(target_column, None)
            if value is not None:
                unique_values.add(value.strip())

    return unique_values

# Replace with your file path and column name
input_csv = 'Crimes_-_2001_to_Present.csv'
column_names = ['Arrest', 'Domestic', 'Beat', 'District', 'Ward', 'Community Area']

#try:
#    for i in column_names:
        #unique_values = find_unique_values(input_csv, i)
        #print(f"Unique values found in column '{i}':")
        #for value in unique_values:
        #    print(value)
#except Exception as e:
#    print(f"Error: {e}")

Import some constants to help navigate the DB. 

Also spinning up MySQL and SQLalchemy Connections to be used later. 

This is the Query I executed in DBeaver in order to create the Table: 
```SQL
USE chicagocrime; 

CREATE TABLE IF NOT EXISTS crime_records (
    id INT PRIMARY KEY,
    case_number VARCHAR(50),
    date DATETIME,
    block TEXT,
    iucr VARCHAR(10),
    primary_type TEXT,
    description TEXT,
    location_description TEXT,
    arrest BOOLEAN,
    domestic BOOLEAN,
    beat VARCHAR(10),
    district VARCHAR(10),
    ward INT,
    community_area VARCHAR(50),
    fbi_code VARCHAR(10),
    x_coordinate DOUBLE,
    y_coordinate DOUBLE,
    year INT,
    updated_on DATETIME,
    latitude DOUBLE,
    longitude DOUBLE,
    location TEXT
)
```

In [6]:
from CC_DB_Constants import CC_DB_TABLE_DEFINTIONS, DB_USR_CONFIG, TABLE_NAMES, DB_COLUMN_NAME_MAP

import mysql.connector
import mysql.connector.errorcode
from sqlalchemy import create_engine
import pymysql

con_string = f"mysql+pymysql://{DB_USR_CONFIG['user']}:{DB_USR_CONFIG['password']}@{DB_USR_CONFIG['host']}/{DB_USR_CONFIG['database']}"

Load the Crime Records as a Pandas Dataframe, preform some data cleaning and formatting the data to match what MySQL wants, and then Insert all of the data and commit it to the MySQL DB. 

In [21]:
def convert_to_iso_format(date_str):
    """
    Converts a date string in the format 'MM/DD/YYYY HH:MM:SS AM/PM'
    into ISO 8601 format 'YYYY-MM-DD HH:MM:SS'.
    """
    # Define the input format
    input_format = '%m/%d/%Y %I:%M:%S %p'
    # Define the desired output format
    output_format = '%Y-%m-%d %H:%M:%S'
    
    try:
        # Parse the input string into a datetime object
        datetime_obj = datetime.datetime.strptime(date_str, input_format)
        # Format the datetime object into the ISO format
        iso_formatted_date = datetime_obj.strftime(output_format)
        return iso_formatted_date
    except ValueError as e:
        print(f"Error converting date: {e}")
        return None


def cast_value(value, expected_type):
    """
    Attempts to cast a value to the expected type.
    Returns the casted value if successful, or None if casting fails.
    """
    try:
        if expected_type == bool:
            # Special handling for boolean values
            if str(value).lower() in ("true", "1", "yes"):
                return '1' # equivalent to 'true'
            elif str(value).lower() in ("false", "0", "no"):
                return '0' # equivalent to 'false'
            else:
                raise ValueError(f"Cannot cast {value} to bool.")
        elif expected_type == datetime.datetime:
            # Attempt to parse datetime
            return convert_to_iso_format(value)
        elif expected_type == int and value == "":
            # Allow empty strings to be cast to None for integers
            return None
        elif expected_type == float and value == "":
            # Allow empty strings to be cast to None for floats
            return None
        elif value in ("", "N/A", "NULL"):
            return None
        else:
            # General casting
            return expected_type(value)
    except (ValueError, TypeError):
        return None

# Main function to process and insert data
def clean_and_insert_pandas(input_file, table_name):
    # Load the CSV into a Pandas DataFrame
    df = pd.read_csv(input_file)

    df.rename(columns=DB_COLUMN_NAME_MAP[table_name], inplace=True)
    print(f"Columns renamed to: {df.columns}")
    
    # Apply data cleaning and transformations
    for column, config in CC_DB_TABLE_DEFINTIONS[table_name].items():
        if column in df.columns:
            expected_type = config['python_type']
            if expected_type == bool:
                df[column] = df[column].apply(lambda x: str(x).lower() in ("true", "1", "yes"))
                df[column] = df[column].apply(lambda x: str(x).lower() in ("false", "0", "no"))
            elif expected_type == datetime.datetime:
                df[column] = df[column].apply(lambda x: convert_to_iso_format(x) if pd.notna(x) else None)
            elif expected_type in [int, float]:
                df[column] = pd.to_numeric(df[column], errors='coerce')
            else:
                #df[column] = df[column].fillna(None)
                print(f"Unhandled expected type encountered: {expected_type}")

    # Create SQLAlchemy engine
    engine = create_engine(con_string)

    # Insert data into the database
    try:
        df.to_sql(TABLE_NAMES['crime'], con=engine, if_exists='append', index=False, chunksize=10000)
        print("Data inserted successfully!")
    except Exception as e:
        print(f"Error inserting data: {e}")

# Replace these with your file path and table name
input_csv = 'Crimes_-_2001_to_Present.csv'
table_name = 'crime_records'

clean_and_insert_pandas(input_csv, table_name)

Columns renamed to: Index(['id', 'case_number', 'date', 'block', 'iucr', 'primary_type',
       'description', 'location_description', 'arrest', 'domestic', 'beat',
       'district', 'ward', 'community_area', 'fbi_code', 'x_coordinate',
       'y_coordinate', 'year', 'updated_on', 'latitude', 'longitude',
       'location'],
      dtype='object')
Unhandled expected type encountered: <class 'str'>
Unhandled expected type encountered: <class 'str'>
Unhandled expected type encountered: <class 'str'>
Unhandled expected type encountered: <class 'str'>
Unhandled expected type encountered: <class 'str'>
Unhandled expected type encountered: <class 'str'>
Unhandled expected type encountered: <class 'str'>
Unhandled expected type encountered: <class 'str'>
Unhandled expected type encountered: <class 'str'>
Unhandled expected type encountered: <class 'str'>
Unhandled expected type encountered: <class 'str'>
Data inserted successfully!
