Task-2

Description:
This task involves using the Pandas library to manipulate data.

Responsibility:
Load a CSV file into a Pandas DataFrame. Perform operations like filtering data based on conditions, handling missing values, and calculating summary statistics.

File:
Documents/Main Flow Intern/01.Data Cleaning and Preprocessing.csv

Description :

This task involves using the Pandas library in Python to perform data manipulation and cleaning on a CSV file containing various observations and measurements. The primary steps include loading the CSV file into a DataFrame, filtering the data based on specific conditions, handling missing values, and calculating summary statistics. Data cleaning is essential to ensure that the dataset is accurate, consistent, and free of errors or missing values, which can significantly impact the quality and reliability of any analysis performed. By filtering data, we can focus on relevant subsets, while handling missing values (either by removing them or filling them with appropriate values) ensures completeness. Summary statistics provide a quick overview of the dataset, helping to identify trends and outliers. This process prepares the data for further analysis or modeling.

Methods used:

pd.read_csv(file_path): Loads a CSV file into a DataFrame.

DataFrame.head(): Displays the first few rows of the DataFrame.

DataFrame.dropna(): Removes rows with any missing values.

DataFrame.fillna(value): Fills missing values with a specified value.

DataFrame.describe(): Provides summary statistics for numerical columns.

DataFrame.mean(): Calculates the mean of a specific column.

Filtering with conditions: Filters rows based on specified conditions.

DataFrame.isna() and DataFrame.sum(): Detects and counts missing values in each column.

In [1]:
import pandas as pd


In [2]:
# Load the CSV file into a DataFrame
file_path = '/content/01.Data Cleaning and Preprocessing.csv'
df = pd.read_csv(file_path)

In [3]:
# Display the first few rows of the DataFrame
print("Initial DataFrame:")
print(df.head())


Initial DataFrame:
  Observation  Y-Kappa  ChipRate  BF-CMratio  BlowFlow  ChipLevel4   \
0    31-00:00    23.10    16.520     121.717  1177.607      169.805   
1    31-01:00    27.60    16.810      79.022  1328.360      341.327   
2    31-02:00    23.19    16.709      79.562  1329.407      239.161   
3    31-03:00    23.60    16.478      81.011  1334.877      213.527   
4    31-04:00    22.90    15.618      93.244  1334.168      243.131   

   T-upperExt-2   T-lowerExt-2    UCZAA  WhiteFlow-4   ...  SteamFlow-4   \
0        358.282         329.545  1.443       599.253  ...        67.122   
1        351.050         329.067  1.549       537.201  ...        60.012   
2        350.022         329.260  1.600       549.611  ...        61.304   
3        350.938         331.142  1.604       623.362  ...        68.496   
4        351.640         332.709    NaN       638.672  ...        70.022   

   Lower-HeatT-3  Upper-HeatT-3   ChipMass-4   WeakLiquorF   BlackFlow-2   \
0        329.432    

In [4]:
# Detect and count missing values in each column
missing_values = df.isna().sum()
print("\nMissing values in each column:")
print(missing_values)


Missing values in each column:
Observation          0
Y-Kappa              0
ChipRate             5
BF-CMratio          17
BlowFlow            16
ChipLevel4           1
T-upperExt-2         2
T-lowerExt-2         2
UCZAA               25
WhiteFlow-4          1
AAWhiteSt-4        151
AA-Wood-4            1
ChipMoisture-4       1
SteamFlow-4          1
Lower-HeatT-3        2
Upper-HeatT-3        2
ChipMass-4           1
WeakLiquorF          1
BlackFlow-2          2
WeakWashF            1
SteamHeatF-3         2
T-Top-Chips-4        1
SulphidityL-4      151
dtype: int64


In [5]:
# Remove rows with any missing values
cleaned_df = df.dropna()
print("\nDataFrame after removing rows with missing values:")
print(cleaned_df)


DataFrame after removing rows with missing values:
    Observation  Y-Kappa  ChipRate  BF-CMratio  BlowFlow  ChipLevel4   \
1      31-01:00    27.60    16.810      79.022  1328.360      341.327   
3      31-03:00    23.60    16.478      81.011  1334.877      213.527   
5       1-08:00    14.23    15.350      85.518  1171.604      198.538   
7      31-06:00    22.65    14.100      91.887  1307.852      288.989   
9      31-08:00    24.70    13.850      96.208  1334.892      362.511   
..          ...      ...       ...         ...       ...          ...   
312    31-10:00    24.40    14.117      85.998  1330.104      394.234   
317     4-16:00    17.80    16.625      78.367  1276.082      202.744   
319    10-16:00    23.75    12.667      93.450  1178.252      276.955   
320     9-19:00    19.80    12.558      94.352  1184.119      297.071   
322     9-21:00    24.32    13.083      88.910  1192.879      318.006   

     T-upperExt-2   T-lowerExt-2    UCZAA  WhiteFlow-4   ...  SteamFlow

In [7]:
# Fill missing values using forward fill
filled_ffill_df = df.fillna(method='ffill')
print("\nDataFrame after filling missing values with forward fill:")
print(filled_ffill_df)




DataFrame after filling missing values with forward fill:
    Observation  Y-Kappa  ChipRate  BF-CMratio  BlowFlow  ChipLevel4   \
0      31-00:00    23.10    16.520     121.717  1177.607      169.805   
1      31-01:00    27.60    16.810      79.022  1328.360      341.327   
2      31-02:00    23.19    16.709      79.562  1329.407      239.161   
3      31-03:00    23.60    16.478      81.011  1334.877      213.527   
4      31-04:00    22.90    15.618      93.244  1334.168      243.131   
..          ...      ...       ...         ...       ...          ...   
319    10-16:00    23.75    12.667      93.450  1178.252      276.955   
320     9-19:00    19.80    12.558      94.352  1184.119      297.071   
321     9-20:00    23.01    12.550      90.842  1188.517      289.826   
322     9-21:00    24.32    13.083      88.910  1192.879      318.006   
323     9-22:00    25.75    13.417      85.451  1186.342      248.312   

     T-upperExt-2   T-lowerExt-2    UCZAA  WhiteFlow-4   ...  St

In [8]:
# Fill missing values using backward fill
filled_bfill_df = df.fillna(method='bfill')
print("\nDataFrame after filling missing values with backward fill:")
print(filled_bfill_df)


DataFrame after filling missing values with backward fill:
    Observation  Y-Kappa  ChipRate  BF-CMratio  BlowFlow  ChipLevel4   \
0      31-00:00    23.10    16.520     121.717  1177.607      169.805   
1      31-01:00    27.60    16.810      79.022  1328.360      341.327   
2      31-02:00    23.19    16.709      79.562  1329.407      239.161   
3      31-03:00    23.60    16.478      81.011  1334.877      213.527   
4      31-04:00    22.90    15.618      93.244  1334.168      243.131   
..          ...      ...       ...         ...       ...          ...   
319    10-16:00    23.75    12.667      93.450  1178.252      276.955   
320     9-19:00    19.80    12.558      94.352  1184.119      297.071   
321     9-20:00    23.01    12.550      90.842  1188.517      289.826   
322     9-21:00    24.32    13.083      88.910  1192.879      318.006   
323     9-22:00    25.75    13.417      85.451  1186.342      248.312   

     T-upperExt-2   T-lowerExt-2    UCZAA  WhiteFlow-4   ...  S

In [9]:
# Provide summary statistics for numerical columns
summary_stats = df.describe()
print("\nSummary statistics:")
print(summary_stats)


Summary statistics:
          Y-Kappa    ChipRate  BF-CMratio     BlowFlow  ChipLevel4   \
count  324.000000  319.000000  307.000000   308.000000   323.000000   
mean    20.635370   14.347937   87.464456  1237.837614   258.164483   
std      3.070036    1.499095    7.995012   100.593735    87.987452   
min     12.170000    9.983000   68.645000     0.000000     0.000000   
25%     18.382500   13.358000   81.823000  1193.215250   213.527000   
50%     20.845000   14.308000   86.739000  1273.138500   271.792000   
75%     23.032500   15.517000   92.372000  1289.196000   321.680000   
max     27.600000   16.958000  121.717000  1351.240000   419.014000   

       T-upperExt-2   T-lowerExt-2         UCZAA  WhiteFlow-4   AAWhiteSt-4   \
count     322.000000      322.000000  299.000000    323.000000    173.000000   
mean      356.904295      324.020180    1.492010    591.732260      6.140410   
std         9.209290        7.621402    0.105923     67.016351      0.081609   
min       339.16800

In [10]:
# Calculate the mean of specific columns (example for 'ChipRate' and 'BlowFlow')
chip_rate_mean = df['ChipRate'].mean()
blow_flow_mean = df['BlowFlow'].mean()
print(f'\nMean ChipRate: {chip_rate_mean}')
print(f'Mean BlowFlow: {blow_flow_mean}')


Mean ChipRate: 14.347937304075236
Mean BlowFlow: 1237.8376136363638


In [11]:
# Filter rows where 'ChipRate' is greater than 50
filtered_df = df[df['ChipRate'] > 50]
print("\nFiltered DataFrame (ChipRate > 50):")
print(filtered_df)


Filtered DataFrame (ChipRate > 50):
Empty DataFrame
Columns: [Observation, Y-Kappa, ChipRate, BF-CMratio, BlowFlow, ChipLevel4 , T-upperExt-2 , T-lowerExt-2  , UCZAA, WhiteFlow-4 , AAWhiteSt-4 , AA-Wood-4  , ChipMoisture-4 , SteamFlow-4 , Lower-HeatT-3, Upper-HeatT-3 , ChipMass-4 , WeakLiquorF , BlackFlow-2 , WeakWashF , SteamHeatF-3 , T-Top-Chips-4 , SulphidityL-4 ]
Index: []

[0 rows x 23 columns]
