1. Importing Libraries

Explanation:
  pandas: A powerful data manipulation library in Python, used for reading and processing dat
a   StandardScaler from sklearn.preprocessing: A tool for standardizing features by removing the mean and scaling to unit variance.

Purpose:
These libraries are essential for data manipulation and normalization, which are crucial steps in data preprocessing for machine learning models.

In [1]:
# Import necessary libraries
import pandas as pd
from sklearn.preprocessing import StandardScaler

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


2. Loading the Data

Explanation:
  pd.read_csv(file_path): Reads the CSV file from the specified path into a pandas DataFrame.

Purpose:
Loading the data into a DataFrame allows you to manipulate and analyze it easily using pandas functions.

In [4]:
# Load the CSV file
file_path = 'reported.csv'  
data = pd.read_csv(file_path)

3. Displaying the First Few Rows, Displaying Basic Information and Displaying Summary Statistics


Displaying the First Few Rows:

Explanation:
data.head(): Displays the first five rows of the DataFrame.

Purpose:
Provides a quick overview of the dataset's structure and the initial few records, helping to understand what kind of data you are working with.

Displaying Basic Information:

Explanation:
data.info(): Provides a summary of the DataFrame, including the number of non-null entries, column data types, and memory usage.

Purpose:
Helps identify the presence of missing values and understand the types of data in each column, which is important for subsequent data cleaning steps.

Displaying Summary Statistics:

Explanation:
data.describe(): Generates descriptive statistics, such as mean, median, min, and max, for numeric columns in the DataFrame.

Purpose:
Provides insights into the distribution and central tendency of the data, helping to detect any anomalies or outliers.


In [5]:
# Display the first few rows of the dataframe
print("First few rows of the data:")
print(data.head())

# Display basic information about the data
print("\nBasic information about the data:")
print(data.info())

# Display summary statistics
print("\nSummary statistics of the data:")
print(data.describe())

First few rows of the data:
   Year  crimes.total  crimes.penal.code  crimes.person  murder  assault  \
0  1950          2784               2306            120       1      105   
1  1951          3284               2754            125       1      109   
2  1952          3160               2608            119       1      104   
3  1953          2909               2689            119       1      105   
4  1954          3028               2791            126       1      107   

   sexual.offenses  rape  stealing.general  burglary  ...  vehicle.theft  \
0               40     5              1578       295  ...            NaN   
1               45     6              1899       342  ...            NaN   
2               39     4              1846       372  ...            NaN   
3               45     5              1929       361  ...            NaN   
4               41     5              1981       393  ...            NaN   

   out.of.vehicle.theft  shop.theft  robbery  fraud  crimi

4. Checking for Missing Values

Explanation:  data.isnull().sum(): Counts the number of missing values in each column.

Purpose:
Identifies columns with missing data, which is essential for data cleaning.

In [6]:
# Check for missing values
print("\nMissing values in the data:")
print(data.isnull().sum())



Missing values in the data:
Year                     0
crimes.total             0
crimes.penal.code        0
crimes.person            0
murder                   0
assault                  0
sexual.offenses          0
rape                     0
stealing.general         0
burglary                 0
house.theft             15
vehicle.theft            7
out.of.vehicle.theft    15
shop.theft              15
robbery                  0
fraud                    0
criminal.damage          0
other.penal.crimes       0
narcotics                4
drunk.driving            0
population               0
dtype: int64


5. Handling Missing Values

Explanation:  fillna(data['column_name'].median(), inplace=True): Fills missing values in the specified column with the median value of that column.

Purpose:
Filling missing values with the median helps to maintain data integrity by not skewing the data distribution, especially when dealing with outliers.

In [None]:
# Handling missing values (example: filling with median)
data['house.theft'].fillna(data['house.theft'].median(), inplace=True)
data['vehicle.theft'].fillna(data['vehicle.theft'].median(), inplace=True)
data['out.of.vehicle.theft'].fillna(data['out.of.vehicle.theft'].median(), inplace=True)
data['shop.theft'].fillna(data['shop.theft'].median(), inplace=True)
data['narcotics'].fillna(data['narcotics'].median(), inplace=True)


6. Verifying Missing Values

Explanation:  Rechecks the DataFrame for any remaining missing values after imputation.

Purpose:
Ensures that all missing values have been appropriately handled.

In [None]:
# Verify if there are any remaining missing values
print("\nMissing values after handling them:")
print(data.isnull().sum())

7. Normalizing the Data

Explanation:  StandardScaler: Normalizes the data to have a mean of 0 and a standard deviation of 1, which is important for many machine learning algorithm
s   scaler.fit_transform(data.drop('Year', axis=1)): Applies normalization to all columns except 'Yea
r    pd.DataFrame(): Converts the normalized numpy array back to a DataFrame and reattaches the 'Year' col

Purpose:
Normalization ensures that all features contribute equally to the analysis and model training, preventing any single feature from dominating due to its scale.umn.

In [None]:
# Normalizing the data (excluding the 'Year' column)
scaler = StandardScaler()
data_scaled = pd.DataFrame(scaler.fit_transform(data.drop('Year', axis=1)), columns=data.columns[1:])
data_scaled['Year'] = data['Year']

8. Displaying Cleaned and Normalized Data

Explanation:
Displays the first few rows of the cleaned and normalized DataFrame.

Purpose:
Provides a final check to ensure that the data cleaning and normalization processes have been correctly applied.

Explanation:  print("\nCleaned and normalized data:"): Prints a header message to indicate that the following output will display the cleaned and normalized dat
a   print(data_scaled.head()): Displays the first five rows of the cleaned and normalized DataFrame data_scaled.

Purp

    Verification: Provides a quick visual check to confirm that the data has been cleaned and normalized correctly. It allows you to see a sample of the processed data directly in the no

Explanation:  data_scaled.to_csv('cleaned_Swedish_data.csv', index=False): Saves the cleaned and normalized DataFrame data_scaled to a CSV file named cleaned_Swedish_data.cs
v   'cleaned_Swedish_data.csv': The name of the file where the data will be sav
e    index=False: This parameter ensures that the DataFrame's index is not saved as a separate column in the CSV file.

Pup

    Persistence: Ensures that the cleaned and normalized data is stored permanently on disk, allowing for future use without needing to reprocess thta.
    Sharing: Facilitates easy sharing of the processed data with colleagues or for use in different environments orxplanation:  print("Cleaned and normalized data saved to 'cleaned_Swedish_data.csv'."): Prints a confirmation message indicating that the data has been successfully saved to the specified CSV file.

Purpo
    Feedback: Provides immediate feedback to the user, confirming that the data-saving operation was successful. This is useful for ensuring that the data has been saved correctly without needing to manually check the file system. tools.

tebook.


In [None]:
# Display the cleaned and normalized data
print("\nCleaned and normalized data:")
print(data_scaled.head())

# Save the cleaned and normalized data to a CSV file
data_scaled.to_csv('cleaned_Swedish_data.csv', index=False)

# Print confirmation
print("Cleaned and normalized data saved to 'cleaned_Swedish_data.csv'.")