Step 1: Load the Data and Handle Missing Values
Load each CSV file and display the first few rows of each dataset.

File Paths: A list of file paths for each CSV file is created.

Function Definition: A function load_and_display_csv is defined to load a CSV file and print the first few rows
.
Loading and Displaying: The function is called for each file path in the list, and the resulting DataFrames are stored in the dataframes list.e.

In [19]:
import pandas as pd

# Define the file paths
file_paths = [
    'MPS Borough Level Crime (Historical).csv',
    'MPS Borough Level Crime (most recent 24 months).csv',
    'MPS LSOA Level Crime (Historical).csv',
    'MPS LSOA Level Crime (most recent 24 months).csv',
    'MPS Ward Level Crime (Historical).csv',
    'MPS Ward Level Crime (most recent 24 months).csv',
    'reported.csv',
    'SYB63_328_202009_Intentional Homicides and Other Crimes.csv'
]

# Function to load and display the first few rows of each dataset
def load_and_display_csv(file_path):
    df = pd.read_csv(file_path)
    df = df.fillna('')  # Fill missing values with an empty string
    print(f"Displaying the first few rows of the dataset: {file_path}")
    print(df.head())
    return df

# Load and display each dataset
dataframes = [load_and_display_csv(path) for path in file_paths]


Displaying the first few rows of the dataset: MPS Borough Level Crime (Historical).csv
                   MajorText                        MinorText  \
0  ARSON AND CRIMINAL DAMAGE                            ARSON   
1  ARSON AND CRIMINAL DAMAGE                  CRIMINAL DAMAGE   
2                   BURGLARY  BURGLARY BUSINESS AND COMMUNITY   
3                   BURGLARY           BURGLARY IN A DWELLING   
4              DRUG OFFENCES              POSSESSION OF DRUGS   

            BoroughName  201004  201005  201006  201007  201008  201009  \
0  Barking and Dagenham       6       5      11      10       6       6   
1  Barking and Dagenham     204     190     218     217     203     161   
2  Barking and Dagenham      48      58      58      46      46      51   
3  Barking and Dagenham     116     102     124     137     153     136   
4  Barking and Dagenham      76      64      82      72      98      87   

   201010  ...  202107  202108  202109  202110  202111  202112  202201 

In [1]:
import pandas as pd

# Define the file paths
file_paths = [
    'MPS Borough Level Crime (Historical).csv',
    'MPS Borough Level Crime (most recent 24 months).csv',
    'MPS LSOA Level Crime (Historical).csv',
    'MPS LSOA Level Crime (most recent 24 months).csv',
    'MPS Ward Level Crime (Historical).csv',
    'MPS Ward Level Crime (most recent 24 months).csv',
    'reported.csv',
    'SYB63_328_202009_Intentional Homicides and Other Crimes.csv'
]

# Function to load and display the first few rows and basic info of each dataset
def inspect_csv(file_path):
    df = pd.read_csv(file_path)
    print(f"Displaying the first few rows of the dataset: {file_path}")
    print(df.head())
    print(f"\nBasic information of the dataset: {file_path}")
    print(df.info())
    print(f"\nMissing values in the dataset: {file_path}")
    print(df.isnull().sum())
    return df

# Load and inspect each dataset
dataframes = {path: inspect_csv(path) for path in file_paths}


Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


Displaying the first few rows of the dataset: MPS Borough Level Crime (Historical).csv
                   MajorText                        MinorText  \
0  ARSON AND CRIMINAL DAMAGE                            ARSON   
1  ARSON AND CRIMINAL DAMAGE                  CRIMINAL DAMAGE   
2                   BURGLARY  BURGLARY BUSINESS AND COMMUNITY   
3                   BURGLARY           BURGLARY IN A DWELLING   
4              DRUG OFFENCES              POSSESSION OF DRUGS   

            BoroughName  201004  201005  201006  201007  201008  201009  \
0  Barking and Dagenham       6       5      11      10       6       6   
1  Barking and Dagenham     204     190     218     217     203     161   
2  Barking and Dagenham      48      58      58      46      46      51   
3  Barking and Dagenham     116     102     124     137     153     136   
4  Barking and Dagenham      76      64      82      72      98      87   

   201010  ...  202107  202108  202109  202110  202111  202112  202201 

Step 2: Define Text Cleaning Functions and Handle Non-Text Values

Explanation

File Paths: List of file paths for each CSV file is provided.

Load and Display CSV Files: A function is defined to load each CSV file, fill missing values with an empty string, and print the first few rows. The DataFrames are stored in a list named dataframes.

Download NLTK Resources: Ensure the stopwords and tokenizer resources are available.

Define clean_text Function: This function now checks if the input is a string before processing it.

Define handle_missing_and_clean_text Function: This function handles missing values and cleans text data.

Define load_and_clean_csv Function: This function loads a CSV file, applies the cleaning function, and prints the first few rows.

Load, Clean, and Display CSV Files: The function is called for each file path in the list, and the resulting DataFrames are stored in the dataframes list.

Run code to load, clean, and display your datasets. Each dataset will be printed with its name and the first few rows of the cleaned text data.

This should resolve the issue and handle non-text values appropriately.

In [20]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Download stopwords if not already downloaded
nltk.download('stopwords')
nltk.download('punkt')

# Function to clean text
def clean_text(text):
    if isinstance(text, str):
        tokens = word_tokenize(text.lower())
        tokens = [word for word in tokens if word.isalnum()]
        tokens = [word for word in tokens if word not in stopwords.words('english')]
        return ' '.join(tokens)
    return text

# Function to handle missing values and clean text data
def handle_missing_and_clean_text(df):
    # Fill missing values for text columns with an empty string
    df = df.fillna('')
    
    # Combine relevant text columns if needed
    if 'MajorText' in df.columns and 'MinorText' in df.columns:
        df['description'] = df['MajorText'] + " " + df['MinorText']
    elif 'description' not in df.columns:
        df['description'] = df[df.columns[0]]
    
    # Apply the cleaning function to the description column
    df['clean_text'] = df['description'].apply(clean_text)
    
    return df

# Function to load, clean, and display the first few rows of each dataset
def load_and_clean_csv(file_path):
    df = pd.read_csv(file_path)
    df = handle_missing_and_clean_text(df)
    print(f"Displaying the first few rows of the cleaned dataset: {file_path}")
    print(df[['description', 'clean_text']].head())
    return df

# Load, clean, and display each dataset
dataframes = [load_and_clean_csv(path) for path in file_paths]





[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\jeffo\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\jeffo\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Displaying the first few rows of the cleaned dataset: MPS Borough Level Crime (Historical).csv
                                 description  \
0            ARSON AND CRIMINAL DAMAGE ARSON   
1  ARSON AND CRIMINAL DAMAGE CRIMINAL DAMAGE   
2   BURGLARY BURGLARY BUSINESS AND COMMUNITY   
3            BURGLARY BURGLARY IN A DWELLING   
4          DRUG OFFENCES POSSESSION OF DRUGS   

                              clean_text  
0            arson criminal damage arson  
1  arson criminal damage criminal damage  
2   burglary burglary business community  
3             burglary burglary dwelling  
4         drug offences possession drugs  
Displaying the first few rows of the cleaned dataset: MPS Borough Level Crime (most recent 24 months).csv
                                 description  \
0            ARSON AND CRIMINAL DAMAGE ARSON   
1  ARSON AND CRIMINAL DAMAGE CRIMINAL DAMAGE   
2            BURGLARY BURGLARY - RESIDENTIAL   
3   BURGLARY BURGLARY BUSINESS AND COMMUNITY   
4           

Step 2: Handle Missing Values and Clean Text Data

We'll write a function to handle missing values and clean the text data for each DataFrame.

In [2]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Download stopwords if not already downloaded
nltk.download('stopwords')
nltk.download('punkt')

# Function to clean text
def clean_text(text):
    if isinstance(text, str):
        tokens = word_tokenize(text.lower())
        tokens = [word for word in tokens if word.isalnum()]
        tokens = [word for word in tokens if word not in stopwords.words('english')]
        return ' '.join(tokens)
    return text

# Function to handle missing values and clean text data
def handle_missing_and_clean_text(df, text_columns):
    # Fill missing values for text columns with an empty string
    df = df.fillna('')
    
    # Combine relevant text columns into 'description'
    df['description'] = df[text_columns].apply(lambda row: ' '.join(row.values.astype(str)), axis=1)
    
    # Apply the cleaning function to the description column
    df['clean_text'] = df['description'].apply(clean_text)
    
    return df

# Function to load, clean, and display the first few rows of each dataset
def load_and_clean_csv(file_path, text_columns):
    df = pd.read_csv(file_path)
    df = handle_missing_and_clean_text(df, text_columns)
    print(f"\nDisplaying the first few rows of the cleaned dataset: {file_path}")
    print(df[text_columns + ['description', 'clean_text']].head())
    return df

# Define the file paths and respective text columns to be combined
datasets_info = [
    ('MPS Borough Level Crime (Historical).csv', ['MajorText', 'MinorText']),
    ('MPS Borough Level Crime (most recent 24 months).csv', ['MajorText', 'MinorText']),
    ('MPS LSOA Level Crime (Historical).csv', ['Major Category', 'Minor Category']),
    ('MPS LSOA Level Crime (most recent 24 months).csv', ['Major Category', 'Minor Category']),
    ('MPS Ward Level Crime (Historical).csv', ['MajorText', 'MinorText']),
    ('MPS Ward Level Crime (most recent 24 months).csv', ['MajorText', 'MinorText']),
    ('reported.csv', ['Year', 'crimes.total', 'crimes.penal.code', 'crimes.person', 'murder', 'assault', 'sexual.offenses', 'rape', 'stealing.general', 'burglary']),
    ('SYB63_328_202009_Intentional Homicides and Other Crimes.csv', ['T11', 'Intentional homicides and other crimes', 'Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4', 'Unnamed: 6'])
]

# Load, clean, and display each dataset
dataframes = [load_and_clean_csv(file_path, text_columns) for file_path, text_columns in datasets_info]


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\jeffo\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\jeffo\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!



Displaying the first few rows of the cleaned dataset: MPS Borough Level Crime (Historical).csv
                   MajorText                        MinorText  \
0  ARSON AND CRIMINAL DAMAGE                            ARSON   
1  ARSON AND CRIMINAL DAMAGE                  CRIMINAL DAMAGE   
2                   BURGLARY  BURGLARY BUSINESS AND COMMUNITY   
3                   BURGLARY           BURGLARY IN A DWELLING   
4              DRUG OFFENCES              POSSESSION OF DRUGS   

                                 description  \
0            ARSON AND CRIMINAL DAMAGE ARSON   
1  ARSON AND CRIMINAL DAMAGE CRIMINAL DAMAGE   
2   BURGLARY BURGLARY BUSINESS AND COMMUNITY   
3            BURGLARY BURGLARY IN A DWELLING   
4          DRUG OFFENCES POSSESSION OF DRUGS   

                              clean_text  
0            arson criminal damage arson  
1  arson criminal damage criminal damage  
2   burglary burglary business community  
3             burglary burglary dwelling  
4        

In [3]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Download stopwords if not already downloaded
nltk.download('stopwords')
nltk.download('punkt')

# Function to clean text
def clean_text(text):
    if isinstance(text, str):
        tokens = word_tokenize(text.lower())
        tokens = [word for word in tokens if word.isalnum()]
        tokens = [word for word in tokens if word not in stopwords.words('english')]
        return ' '.join(tokens)
    return text

# Function to handle missing values and clean text data
def handle_missing_and_clean_text(df, text_columns):
    # Fill missing values for text columns with an empty string
    df = df.fillna('')
    
    # Combine relevant text columns into 'description'
    df['description'] = df[text_columns].apply(lambda row: ' '.join(row.values.astype(str)), axis=1)
    
    # Apply the cleaning function to the description column
    df['clean_text'] = df['description'].apply(clean_text)
    
    return df

# Function to load, clean, and display the first few rows of each dataset
def load_and_clean_csv(file_path, text_columns):
    df = pd.read_csv(file_path)
    df = handle_missing_and_clean_text(df, text_columns)
    print(f"\nDisplaying the first few rows of the cleaned dataset: {file_path}")
    print(df[text_columns + ['description', 'clean_text']].head())
    return df

# Define the file paths and respective text columns to be combined
datasets_info = [
    ('MPS Borough Level Crime (Historical).csv', ['MajorText', 'MinorText']),
    ('MPS Borough Level Crime (most recent 24 months).csv', ['MajorText', 'MinorText']),
    ('MPS LSOA Level Crime (Historical).csv', ['Major Category', 'Minor Category']),
    ('MPS LSOA Level Crime (most recent 24 months).csv', ['Major Category', 'Minor Category']),
    ('MPS Ward Level Crime (Historical).csv', ['MajorText', 'MinorText']),
    ('MPS Ward Level Crime (most recent 24 months).csv', ['MajorText', 'MinorText']),
    ('reported.csv', ['Year', 'crimes.total', 'crimes.penal.code', 'crimes.person', 'murder', 'assault', 'sexual.offenses', 'rape', 'stealing.general', 'burglary']),
    ('SYB63_328_202009_Intentional Homicides and Other Crimes.csv', ['T11', 'Intentional homicides and other crimes', 'Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4', 'Unnamed: 6'])
]

# Load, clean, and display each dataset
dataframes = [load_and_clean_csv(file_path, text_columns) for file_path, text_columns in datasets_info]


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\jeffo\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\jeffo\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!



Displaying the first few rows of the cleaned dataset: MPS Borough Level Crime (Historical).csv
                   MajorText                        MinorText  \
0  ARSON AND CRIMINAL DAMAGE                            ARSON   
1  ARSON AND CRIMINAL DAMAGE                  CRIMINAL DAMAGE   
2                   BURGLARY  BURGLARY BUSINESS AND COMMUNITY   
3                   BURGLARY           BURGLARY IN A DWELLING   
4              DRUG OFFENCES              POSSESSION OF DRUGS   

                                 description  \
0            ARSON AND CRIMINAL DAMAGE ARSON   
1  ARSON AND CRIMINAL DAMAGE CRIMINAL DAMAGE   
2   BURGLARY BURGLARY BUSINESS AND COMMUNITY   
3            BURGLARY BURGLARY IN A DWELLING   
4          DRUG OFFENCES POSSESSION OF DRUGS   

                              clean_text  
0            arson criminal damage arson  
1  arson criminal damage criminal damage  
2   burglary burglary business community  
3             burglary burglary dwelling  
4        