üåå ZTF Dataset Column Definitions
üî≠ 1. ra

Right Ascension (in degrees) of the detection/object.

Equivalent to longitude on the sky.

Range: 0¬∞ to 360¬∞.

üî≠ 2. dec

Declination (in degrees)

Equivalent to latitude on the sky.

Range: ‚Äì90¬∞ to +90¬∞.

üß† 3. infobits

A bitwise flag indicating processing information about the exposure.

Each bit represents a specific condition (e.g., read noise issue, saturated pixels, etc.).

Typical values: 0 (no issues) or powers of 2 like 67108864.

üó∫ 4. field

ZTF Field ID

ZTF sky is divided into 4096 square fields.

This tells you which sky field the exposure belongs to.

Example:
field = 570 ‚Üí It is from ZTF Field 570.

üì∑ 5. ccdid

The ZTF camera has 16 CCDs, numbered:

1 ‚Üí 16

Represents which CCD captured this image.

üì∑ 6. qid

Each CCD has quadrants (sub-detectors):

QID = 1, 2, 3, or 4

Each corresponds to one readout quadrant.

üì∑ 7. rcid

Readout Channel ID (0‚Äì63)

ZTF camera has 64 channels, each corresponds to a unique CCD quadrant.

rcid = (CCDid - 1) √ó 4 + (QID - 1)

Example:
CCDid = 16, QID = 1 ‚Üí rcid = 60.

Matches your data:
ccdid = 16, qid = 1 ‚Üí rcid = 60

üé® 8. fid

Filter ID:

fid	Filter
1	zg (green)
2	zr (red)
3	zi (infrared)

In the data:
fid = 2 ‚Üí zr-band (red).

üé® 9. filtercode

The name of the filter used:

‚Äúzg‚Äù

‚Äúzr‚Äù

‚Äúzi‚Äù

Matches fid.

üÜî 10. pid

Processing ID (unique integer)

A unique ID associated with the image/exposure

Higher PID = more recent observation.

üó∫ 11‚Äì18. ra1, dec1, ‚Ä¶ ra4, dec4

These are the Four Corners of the CCD image footprint.

ZTF stores the sky coordinates of the 4 corners of each CCD/quadrant image:

Column	Meaning
ra1, dec1	Corner 1 of the image
ra2, dec2	Corner 2
ra3, dec3	Corner 3
ra4, dec4	Corner 4

These are used for:

Mapping CCD footprint on the sky

WCS calculations

Checking if a target falls inside the image

Astrometric corrections

Example from your row:

ra1 = 142.9726¬∞, dec1 = 22.6797¬∞

ra2 = 142.0367¬∞, dec2 = 22.6647¬∞

‚Ä¶ and so on

Each set forms a quadrilateral outlining the image.

üìÖ 19. ipac_pub_date

The date the exposure was published to IRSA (not observation date).

Example:
2020-12-09 00:00:00+00

üî¢ 20. ipac_gid

Group ID used internally by IPAC.

Usually values: 1‚Äì10

Meaning:

Grouping of exposures related to same night or processing batch.

<details>
<summary>üìå Cell Description: Loading and Inspecting the Raw ZTF Dataset</summary>

This cell loads the previously downloaded ZTF search results into a DataFrame so the data can be inspected and used for further processing. It reads the CSV file generated from the ZTF image search, displays the first few rows, and prints the number of rows and columns in the dataset. This quick check helps confirm that the data was loaded correctly, contains the expected fields, and is ready for cleaning, selection, and analysis in the later steps of the research workflow. By previewing the structure of the dataset, the researcher gets an initial understanding of the metadata available, such as observation IDs, filters, timestamps, and file paths.

---

### üîπ **Key Points (Simple & Attractive Explanation)**

- **Reads the ZTF dataset** from a CSV file created during the image search step.  
- **Loads the data into a Pandas DataFrame**, which is the standard tool for working with tabular datasets in data science.  
- **Displays the first few rows** (`df.head()`), allowing the researcher to visually confirm what type of information the dataset contains.  
- **Prints the dataset size** (number of rows and columns), helping to understand how large the dataset is before preprocessing.  
- **Ensures the loaded data is correct** before moving on to cleaning, filtering, or extracting useful metadata.  
- **Serves as the ‚Äústarting point‚Äù** for all subsequent data science operations such as feature engineering, anomaly detection, visualizations, and model training.  
- **Helps verify the file path and content**, avoiding errors that could appear later if the dataset was not loaded properly.

---

### ‚≠ê **Why This Cell Matters for the Research**
This step confirms that the astronomical metadata needed for the research is successfully captured in a structured format. Since the entire pipeline‚Äîincluding feature extraction, visualization, and modelling‚Äîdepends on this dataset, it is essential to verify the data early. By previewing and summarizing the dataset, the researcher ensures that the upcoming steps in the workflow are operating on valid ZTF data.

</details>


In [5]:
import pandas as pd

# Load into a DataFrame
df = pd.read_csv('ztf_image_search_results_full.csv')
print(df.head())

# Summary of the raw dataset
print('Rows, Columns:', df.shape)

           ra        dec  infobits  field  ccdid  qid  rcid  fid filtercode  \
0  142.513076  22.239831  67108864    570     16    1    60    2         zr   
1  141.637575  21.360661         0    570     16    3    62    2         zr   
2  141.618681  22.225694         0    570     16    2    61    2         zr   
3  141.596498  22.204009  67108912    570     16    2    61    1         zg   
4  141.637182  19.443859         0    570     12    3    46    2         zr   

             pid  ...         ra1       dec1         ra2       dec2  \
0   769412526015  ...  142.972615  22.679692  142.036715  22.664666   
1  1503405246215  ...  142.091906  21.803041  141.161603  21.782992   
2  1504338726115  ...  142.075514  22.668183  141.139773  22.647755   
3  1066544346115  ...  142.053404  22.646507  141.117659  22.626244   
4  1521276364615  ...  142.085851  19.886330  141.167231  19.866253   

          ra3       dec3         ra4       dec4           ipac_pub_date  \
0  142.055997  21.79858

<details>
<summary>üìå Cell Description: Checking Missing Values, Normalizing Data, and Removing Duplicates</summary>

This cell performs the essential early-stage data cleaning steps required to ensure the ZTF metadata is trustworthy and ready for analysis. It begins by identifying the data types of the columns and counting how many missing values exist in each one. Since astronomical datasets often contain different placeholder formats for missing information (such as ‚ÄúNA‚Äù, ‚ÄúNone‚Äù, or empty strings), the cell standardizes all these placeholders into proper `NaN` values. This normalization helps Pandas handle missing data consistently throughout the research.

After cleaning missing-value formats, the cell removes any duplicate rows from the dataset. Duplicates can occur when multiple searches return overlapping results or when different query filters produce repeated entries. Removing duplicates ensures that later analyses‚Äîsuch as statistical summaries, visualizations, or machine-learning tasks‚Äîare not biased or influenced by repeated observations. The cell also prints how many duplicates were removed, giving the researcher a clear understanding of the dataset‚Äôs reliability before proceeding to deeper analysis.

---

### üîπ **Key Points (Simple & Attractive Explanation)**

- **Counts missing values** in each column to understand the overall completeness of the dataset.  
- **Detects different placeholder formats** used by the original data source (e.g., ‚ÄúNA‚Äù, ‚ÄúNone‚Äù, empty spaces).  
- **Standardizes all placeholder values to `NaN`**, allowing Pandas to treat them correctly during feature engineering and modelling.  
- **Improves data consistency**, which is crucial because even a few wrongly formatted entries can break later calculations.  
- **Removes duplicate rows**, ensuring each astronomical observation appears only once.  
- **Prevents model bias**, since duplicated records could artificially inflate certain classes or features.  
- **Prints the number of removed duplicates**, confirming that the dataset has been cleaned successfully.  
- **Forms a clean and reliable foundation** for every upcoming step, including visualization, feature extraction, and ML modelling.  

---

### ‚≠ê **Why This Cell Is Important for the Research**
Astronomical datasets like ZTF metadata often contain missing entries, inconsistent formatting, and duplicate records. If these issues are not corrected early, they can distort statistical patterns, weaken machine-learning performance, and mislead scientific conclusions. This cell establishes a clean, standardized dataset that ensures any insights, visualizations, or models created later in the research are accurate and trustworthy. By normalizing missing values and removing duplicates, the analysis becomes more robust and scientifically reliable.

</details>


In [13]:
print('\nColumn types:')
print(df.dtypes.value_counts())


Column types:
float64    25
int64      14
object      5
Name: count, dtype: int64


In [12]:
print('\nMissing values per column:')
print(df.isnull().sum())


Missing values per column:
ra               0
dec              0
infobits         0
field            0
ccdid            0
qid              0
rcid             0
fid              0
filtercode       0
pid              0
nid              0
expid            0
itid             0
imgtype          0
imgtypecode      0
obsdate          0
obsjd            0
exptime          0
filefracday      0
seeing           0
airmass          0
moonillf         0
moonesb          0
maglimit         0
crpix1           0
crpix2           0
crval1           0
crval2           0
cd11             0
cd12             0
cd21             0
cd22             0
ra1              0
dec1             0
ra2              0
dec2             0
ra3              0
dec3             0
ra4              0
dec4             0
ipac_pub_date    0
ipac_gid         0
log1p_exptime    0
log1p_airmass    0
dtype: int64


In [6]:
# Replace common text placeholders with NaN for consistent handling
import numpy as np
df.replace(['', ' ', 'NA', 'NaN', 'nan', 'None', 'none', 'NULL'], np.nan, inplace=True)
print('\nAfter normalization of placeholders, missing per column:')
print(df.isnull().sum())


After normalization of placeholders, missing per column:
ra               0
dec              0
infobits         0
field            0
ccdid            0
qid              0
rcid             0
fid              0
filtercode       0
pid              0
nid              0
expid            0
itid             0
imgtype          0
imgtypecode      0
obsdate          0
obsjd            0
exptime          0
filefracday      0
seeing           0
airmass          0
moonillf         0
moonesb          0
maglimit         0
crpix1           0
crpix2           0
crval1           0
crval2           0
cd11             0
cd12             0
cd21             0
cd22             0
ra1              0
dec1             0
ra2              0
dec2             0
ra3              0
dec3             0
ra4              0
dec4             0
ipac_pub_date    0
ipac_gid         0
dtype: int64


In [7]:
# 2) Remove duplicate rows
before = df.shape[0]
df = df.drop_duplicates().reset_index(drop=True)
after = df.shape[0]
print(f'Removed {before - after} duplicate rows')

Removed 0 duplicate rows


<details>
<summary>üìå Cell Description: Handling Missing Data Using a Structured Cleaning Strategy</summary>

This cell applies a systematic method to clean missing values from the ZTF dataset. Real astronomical datasets often contain gaps‚Äîsome observations may be incomplete, corrupted, or missing certain metadata fields. Instead of removing all rows with missing values (which would waste valuable data), this cell uses a balanced strategy to preserve as much information as possible while still ensuring dataset quality.

The process begins by removing only those rows that are mostly empty (more than 50% missing values). This avoids keeping rows that contain almost no useful information. After that, the cell separates numerical and categorical columns because each type requires a different method for filling missing values. Numerical columns‚Äîsuch as flux values, coordinates, or exposure metadata‚Äîare filled using the **median**, which is a stable and reliable measure that reduces the influence of outliers. Categorical columns‚Äîsuch as file names, filter types, or detector IDs‚Äîare filled using the **mode**, which replaces missing values with the most common category.

This cleaning strategy ensures that the dataset remains as complete as possible without introducing bias. The final print statement confirms how many missing values remain in each column after the cleaning process, ensuring transparency and reproducibility.

---

### üîπ **Key Points (Simple & Attractive Explanation)**

- **Removes only highly incomplete rows** (rows missing more than 50% of their values), preventing unnecessary data loss.  
- **Separates numerical and categorical columns**, because each type requires different imputation methods.  
- **Fills missing numeric values with the median**, which is robust and prevents distortion from extreme outliers.  
- **Replaces missing categorical values with the mode**, ensuring the filled value matches the most common or expected category.  
- **Uses a fallback option (‚ÄòUnknown‚Äô)** if no mode exists, making the dataset consistent.  
- **Ensures the dataset has no remaining missing values** that could break future analysis steps.  
- **Keeps the data scientifically meaningful**, since the strategy preserves important observations instead of discarding them.  
- **Prepares the dataset for machine learning**, where models require complete and clean data to perform accurately.  

---

### ‚≠ê **Why This Cell Is Important for the Research**
Astronomical data frequently comes with missing fields due to observation limits, sensor errors, or data transmission issues. Poor handling of missing values can lead to biased models, incorrect scientific insights, or unstable algorithms. This cell applies a well-established and research-friendly cleaning strategy that maintains the integrity and usefulness of the ZTF dataset. By filling gaps intelligently and removing only severely incomplete records, the dataset becomes reliable, consistent, and ready for deeper statistical analysis, visualization, and machine-learning workflows.

</details>


In [8]:
# 3) Remove nulls with strategy: drop rows with >50% missing, impute numerics with median and categoricals with mode
threshold = int(df.shape[1] * 0.5)
df = df.dropna(thresh=threshold).reset_index(drop=True)
num_cols = df.select_dtypes(include=['number']).columns.tolist()
cat_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()
for c in num_cols:
    if df[c].isnull().any():
        df[c] = df[c].fillna(df[c].median())
for c in cat_cols:
    if df[c].isnull().any():
        mode = df[c].mode()
        fill_val = mode.iloc[0] if not mode.empty else 'Unknown'
        df[c] = df[c].fillna(fill_val)
print('After null handling, remaining missing per column:')
print(df.isnull().sum())

After null handling, remaining missing per column:
ra               0
dec              0
infobits         0
field            0
ccdid            0
qid              0
rcid             0
fid              0
filtercode       0
pid              0
nid              0
expid            0
itid             0
imgtype          0
imgtypecode      0
obsdate          0
obsjd            0
exptime          0
filefracday      0
seeing           0
airmass          0
moonillf         0
moonesb          0
maglimit         0
crpix1           0
crpix2           0
crval1           0
crval2           0
cd11             0
cd12             0
cd21             0
cd22             0
ra1              0
dec1             0
ra2              0
dec2             0
ra3              0
dec3             0
ra4              0
dec4             0
ipac_pub_date    0
ipac_gid         0
dtype: int64


<details>
<summary>üìå Cell Description: Outlier Removal Using the IQR Method</summary>

This cell removes numerical outliers from the dataset using a well-known statistical technique called the Interquartile Range (IQR) method. Outliers are unusually large or small values that do not fit the normal pattern of the data. In astronomy datasets such as ZTF metadata, outliers can occur due to sensor noise, faulty readings, extreme environmental conditions, or rare technical errors during observations. If left uncorrected, these extreme values can distort graphs, shift averages, mislead machine-learning models, and produce unstable scientific conclusions.

The cell identifies all numerical columns and applies the IQR rule to each one. It calculates the first quartile (Q1), the third quartile (Q3), and the IQR (Q3 ‚àí Q1). Any values falling outside the acceptable range (Q1 ‚àí 1.5√óIQR to Q3 + 1.5√óIQR) are considered outliers. The code removes these rows in a cumulative manner, meaning that once outliers from one column are removed, the next column is checked on the remaining dataset. A final print statement shows how many total rows were removed. This ensures that the cleaned dataset contains realistic and scientifically reliable values that improve the accuracy and stability of later analyses.

---

### üîπ **Key Points (Simple & Attractive Explanation)**

- **Targets numerical columns only**, since outliers mainly occur in measurements rather than text fields.  
- **Uses the IQR rule**, a widely accepted and robust method for detecting extreme values.  
- **Calculates Q1, Q3, and IQR** to understand the typical spread of each numerical feature.  
- **Marks values outside 1.5√óIQR as outliers**, which is a standard threshold used in statistics.  
- **Removes outliers iteratively**, ensuring each column is cleaned based on the updated dataset.  
- **Protects the scientific validity** of the dataset by removing physically unrealistic or faulty measurements.  
- **Prevents machine-learning models from being influenced by incorrect values**, improving prediction stability and accuracy.  
- **Displays how many rows were removed**, providing transparency in the data-cleaning process.  

---

### ‚≠ê **Why This Cell Is Important for the Research**
Astronomical datasets often contain unexpected spikes or errors due to equipment limitations, noise, atmospheric disturbances, or data corruption. These outliers can strongly influence trends, distort visualizations, and misguide the learning behaviour of machine-learning models. By removing unrealistic values using a mathematically sound method, the dataset becomes smoother, cleaner, and far more reliable. This step ensures that all future analysis‚Äîwhether visualization, feature extraction, or modelling‚Äîis based on high-quality and trustworthy astronomical data.

</details>


In [9]:
# 4) Remove outliers using IQR on numeric columns (iterative cumulative removal)
import pandas as pd
num_cols = df.select_dtypes(include=['number']).columns.tolist()
print('Numeric columns to check for outliers:', num_cols)
initial_rows = df.shape[0]
for col in num_cols:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    if pd.isnull(IQR) or IQR == 0:
        continue
    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR
    df = df[(df[col] >= lower) & (df[col] <= upper)].reset_index(drop=True)
print(f'Removed {initial_rows - df.shape[0]} rows as outliers (cumulative)')

Numeric columns to check for outliers: ['ra', 'dec', 'infobits', 'field', 'ccdid', 'qid', 'rcid', 'fid', 'pid', 'nid', 'expid', 'itid', 'obsjd', 'exptime', 'filefracday', 'seeing', 'airmass', 'moonillf', 'moonesb', 'maglimit', 'crpix1', 'crpix2', 'crval1', 'crval2', 'cd11', 'cd12', 'cd21', 'cd22', 'ra1', 'dec1', 'ra2', 'dec2', 'ra3', 'dec3', 'ra4', 'dec4', 'ipac_gid']
Removed 20587 rows as outliers (cumulative)


<details>
<summary>üìå Cell Description: Removing Astronomically Impossible or Invalid Values</summary>

This cell performs domain-specific cleaning by removing values that are scientifically impossible in astronomy or clearly incorrect. While previous cleaning steps handled missing values and statistical outliers, this step ensures that all remaining data follows the real physical rules of the sky. Astronomical coordinates such as Right Ascension (RA) and Declination (Dec) have strict valid ranges. Any value outside these limits indicates an error in measurement or metadata. The cell also removes infinite values, NaN entries, and ensures that measurement-related columns, such as flux error, contain only physically meaningful positive values.

By applying these astronomy-based filters, the dataset becomes scientifically trustworthy. This is essential because machine-learning models trained on physically impossible values may produce misleading predictions. The cell also prints how many rows were removed for each check, improving transparency. This final cleaning stage ensures that the dataset is ready for scientific analysis, visualization, and model training without containing physically unrealistic values.

---

### üîπ **Key Points (Simple & Attractive Explanation)**

- **Validates Right Ascension (RA)**  
  - RA must be between **0¬∞ and 360¬∞**.  
  - Any value outside this range is physically impossible and therefore removed.

- **Validates Declination (Dec)**  
  - Dec must lie between **‚Äì90¬∞ and +90¬∞**, matching the limits of the celestial sphere.  
  - Incorrect readings are filtered out.

- **Removes infinite values**  
  - Infinite or undefined numerical values often occur due to division errors or corrupted entries.  
  - These are replaced with NaN and then removed entirely.

- **Drops remaining NaN values**  
  - Ensures the final dataset contains no missing or undefined measurements.

- **Checks flux error values**  
  - If `flux` and `flux_err` exist:  
    - flux error must be **positive**.  
    - Non-positive values usually indicate sensor faults or improperly processed data.

- **Ensures scientific correctness**  
  - Keeps only physically meaningful astronomical measurements.  
  - Removes corrupted or impossible metadata before modelling.

- **Improves reliability of ML models**  
  - Prevents machine-learning algorithms from learning incorrect or physically meaningless patterns.

---

### ‚≠ê **Why This Cell Is Important for the Research**
Astronomical datasets can contain values that violate the laws of the sky due to sensor glitches, transmission errors, or preprocessing issues. If such values remain in the dataset, they can mislead scientific interpretations or cause machine-learning models to behave unpredictably. By applying physical limits (like RA and Dec ranges) and removing invalid flux measurements, this cell ensures that the dataset respects real astronomical constraints.

This domain-aware cleaning step is what transforms a raw telescope dataset into a **scientifically valid dataset**, ready for trustworthy analysis and modelling.

</details>


In [10]:
# 5) Remove incorrect / impossible values (example astronomical checks)
import numpy as np
if 'ra' in df.columns:
    before = df.shape[0]
    df = df[(df['ra'] >= 0) & (df['ra'] <= 360)].reset_index(drop=True)
    print('Filtered RA outside [0,360], removed', before - df.shape[0])
if 'dec' in df.columns:
    before = df.shape[0]
    df = df[(df['dec'] >= -90) & (df['dec'] <= 90)].reset_index(drop=True)
    print('Filtered Dec outside [-90,90], removed', before - df.shape[0])
# Replace infinities and drop any remaining NaNs
df = df.replace([np.inf, -np.inf], np.nan).dropna().reset_index(drop=True)
print('After removing infinite/NaN values, rows:', df.shape[0])
# Example: require positive flux_err if those columns exist
if set(['flux', 'flux_err']).issubset(df.columns):
    before = df.shape[0]
    df = df[df['flux_err'] > 0].reset_index(drop=True)
    print('Removed rows with non-positive flux_err:', before - df.shape[0])

Filtered RA outside [0,360], removed 0
Filtered Dec outside [-90,90], removed 0
After removing infinite/NaN values, rows: 62368


<details>
<summary>üìå Cell Description: Standardizing Numerical Features for Machine-Learning</summary>

This cell prepares the cleaned astronomical dataset for machine-learning by applying **standardization**, a common scaling technique used in data science. After cleaning invalid values, the dataset may still contain numerical columns with very different units or scales (for example, flux values vs. pixel positions vs. exposure times). Machine-learning models can become unstable or biased if these differences are not handled properly. Standardization transforms all selected numerical columns so that they share a similar scale, making the dataset mathematically well-behaved for algorithms that rely on distance, gradients, or statistical assumptions.

The cell identifies usable numerical columns, excluding ID fields, filenames, and time-related columns that should not be scaled. It then applies `StandardScaler`, which centers the data around a mean of 0 and a standard deviation of 1. The scaled version of the dataset is stored in a new DataFrame called `df_standardized`, ensuring that the original values remain intact for reference or other types of analysis.

---

### üîπ **Key Points (Simple & Attractive Explanation)**

- **Identifies numerical columns suitable for scaling**, ignoring IDs, names, and timestamps because these fields do not represent measurable quantities.  
- **Avoids scaling inappropriate features**, such as filenames or observation dates, which would distort their meaning.  
- **Uses Standardization**:  
  - Converts each selected feature to have a **mean of 0** and **standard deviation of 1**.  
  - Helps algorithms treat all numerical features fairly.  
- **Creates a new DataFrame (`df_standardized`)** containing the standardized version of the dataset.  
- **Prevents model bias**, since unscaled features with large numeric ranges could overpower smaller-scaled features.  
- **Ensures mathematical stability** during machine-learning tasks, especially for distance-based or gradient-based models.  
- **Prints the columns being scaled**, offering transparency and reproducibility.

---

### ‚≠ê **Why This Cell Is Important for the Research**
Astronomical datasets often contain measurements with extremely different numerical ranges‚Äîfor example, flux values, exposure metadata, geometric positions, and detector readouts. Machine-learning models become more accurate, balanced, and reliable when all these values are brought to a common scale. Standardization is therefore a crucial preprocessing step before performing classification, clustering, anomaly detection, or dimensionality reduction. By preserving both the original and standardized versions of the dataset, this cell supports flexible experimentation while ensuring scientific rigor.

</details>


In [11]:
# 6) Scaling: Standardization (store standardized copy)
from sklearn.preprocessing import StandardScaler, MinMaxScaler
numeric_cols = df.select_dtypes(include=['number']).columns.tolist()
exclude_patterns = ['id', 'Id', 'ID', 'name', 'filename', 'time', 'date', 'jd']
exclude_cols = [c for c in df.columns for p in exclude_patterns if p in c]
cols_to_scale = [c for c in numeric_cols if c not in exclude_cols]
print('Columns to scale:', cols_to_scale)
scaler = StandardScaler()
if cols_to_scale:
    df_standardized = df.copy()
    df_standardized[cols_to_scale] = scaler.fit_transform(df[cols_to_scale])
    print('Standardization complete. Dataframe `df_standardized` is available.')
else:
    print('No numeric columns found to scale.')

Columns to scale: ['ra', 'dec', 'infobits', 'field', 'filefracday', 'seeing', 'airmass', 'moonillf', 'moonesb', 'maglimit', 'crpix1', 'crpix2', 'crval1', 'crval2', 'cd11', 'cd12', 'cd21', 'cd22', 'ra1', 'dec1', 'ra2', 'dec2', 'ra3', 'dec3', 'ra4', 'dec4']
Standardization complete. Dataframe `df_standardized` is available.


<details>
<summary>üìå Cell Description: Feature Engineering to Create New Useful Scientific Features</summary>

This cell adds new derived features to the dataset to make the astronomical data more meaningful and informative for analysis and machine-learning. Feature engineering is the process of creating new variables from existing ones so that the model can better understand underlying scientific patterns. In astronomy, raw measurements alone are often not enough‚Äîderived quantities such as signal-to-noise ratio or logarithmic transformations reveal hidden trends that help models interpret faint objects, noisy measurements, and time-based behavior more effectively.

The cell first creates a **signal-to-noise ratio (SNR)** feature, which is critical in astronomy because it measures how strong a signal (flux) is relative to uncertainty (flux error). A higher SNR means a more reliable detection. Then, the cell searches for any possible time-related column (e.g., date, time, JD, MJD) and extracts the year, month, and day from it. This allows the dataset to capture temporal patterns, such as seasonal observation differences or instrument behavior over time.

Finally, for any numerical column that is strictly positive and strongly skewed, the cell creates a **logarithmic transformation** (`log1p_`), which reduces extreme values and makes distributions more balanced. Balanced distributions help machine-learning models learn more consistently. The cell prints messages for each new feature created, ensuring transparency.

---

### üîπ **Key Points (Simple & Attractive Explanation)**

- **Creates SNR feature (`snr = flux / flux_err`)**  
  - A fundamental astronomy metric indicating data quality and detection reliability.  
  - Higher SNR = stronger, clearer astronomical signal.

- **Automatically detects time/date columns**  
  - Searches for patterns like ‚Äúdate‚Äù, ‚Äútime‚Äù, ‚ÄúJD‚Äù, ‚ÄúMJD‚Äù.  
  - Useful because not all datasets name their time columns consistently.

- **Extracts year, month, and day**  
  - Helps analyze trends across years, months, or nights of observation.  
  - Time-based features can improve classification and temporal modelling.

- **Applies logarithmic transformation to skewed positive features**  
  - Makes extremely large values more manageable.  
  - Reduces the effect of extreme outliers in heavily skewed astrophysical distributions.  
  - Adds new columns named `log1p_columnName`.

- **Ensures feature engineering is dynamic and adaptive**  
  - Only applies transformations when conditions are appropriate.  
  - Prevents accidental modification of unsuitable columns.

- **Improves model learning**  
  - Derived features help capture physical relationships that raw measurements alone cannot reveal.

---

### ‚≠ê **Why This Cell Is Important for the Research**
Astronomical measurements often contain noise, extreme values, and hidden patterns that are difficult to detect without transforming the raw data. By generating features such as SNR, time components, and logarithmic versions of skewed measurements, the dataset becomes far richer and more expressive. These engineered features help machine-learning models better interpret celestial behavior, improve prediction accuracy, and capture scientific relationships that are essential in astroinformatics. This step significantly enhances the dataset‚Äôs value and paves the way for more powerful and reliable modelling.

</details>


In [8]:
# 7) Feature engineering
import numpy as np
if set(['flux', 'flux_err']).issubset(df.columns):
    df['snr'] = df['flux'] / df['flux_err']
    print('Created `snr` feature')
possible_date_cols = [c for c in df.columns if any(x in c.lower() for x in ['date','time','jd','mjd'])]
if possible_date_cols:
    date_col = possible_date_cols[0]
    try:
        df[date_col] = pd.to_datetime(df[date_col])
        df['year'] = df[date_col].dt.year
        df['month'] = df[date_col].dt.month
        df['day'] = df[date_col].dt.day
        print(f'Derived year/month/day from {date_col}')
    except Exception as e:
        print('Could not parse date column', date_col, ' - ', e)
for c in df.select_dtypes(include=['number']).columns:
    if (df[c] > 0).all() and df[c].skew() > 1:
        df['log1p_' + c] = np.log1p(df[c])
        print('Created log1p_' + c)
print('Feature engineering complete.')

Could not parse date column obsdate  -  time data "2018-03-25 06:35:35+00" doesn't match format "%Y-%m-%d %H:%M:%S.%f%z", at position 35. You might want to try:
    - passing `format` if your strings have a consistent format;
    - passing `format='ISO8601'` if your strings are all ISO8601 but not necessarily in exactly the same format;
    - passing `format='mixed'`, and the format will be inferred for each element individually. You might want to use `dayfirst` alongside this.
Created log1p_exptime
Created log1p_airmass
Feature engineering complete.


<details>
<summary>üìå Cell Description: Handling Class Imbalance Using SMOTE (If a Target Column Exists)</summary>

This cell checks whether the dataset contains a target or label column (such as ‚Äúclass‚Äù, ‚Äútype‚Äù, or ‚Äútarget‚Äù) and, if so, it attempts to correct **class imbalance** using SMOTE. Class imbalance happens when one category has many more samples than others. This is very common in astronomy‚Äîfor example, there may be thousands of normal observations but only a few rare events like supernovae or unusual transients. If this imbalance is not handled, machine-learning models tend to ignore the rare classes and perform poorly on the very objects scientists care most about.

This cell automatically detects a target column, prints the current class distribution, and then applies **SMOTE (Synthetic Minority Oversampling Technique)**. SMOTE creates new synthetic examples for the minority classes by interpolating between existing observations. This helps balance the dataset without simply duplicating rows. The cell uses only numerical features for SMOTE, generates a new balanced dataset, and prints the final shape. If SMOTE cannot run (e.g., missing library or no numerical columns), the cell safely skips resampling and prints a clear message.

---

### üîπ **Key Points (Simple & Attractive Explanation)**

- **Automatically detects a target column**, such as ‚Äúlabel‚Äù, ‚Äúclass‚Äù, ‚Äútype‚Äù, or ‚Äútarget‚Äù.  
- **Displays class counts before resampling**, giving a clear picture of how imbalanced the dataset is.  
- **Uses SMOTE**, a widely used technique that:  
  - Creates **synthetic samples** for minority classes.  
  - Prevents models from being biased toward the majority class.  
  - Improves fairness and predictive accuracy.  
- **Uses only numerical features** to generate synthetic samples, as SMOTE requires numeric inputs.  
- **Produces a new balanced dataset (`df_res`)** that can be used for machine-learning tasks.  
- **Handles errors safely**‚Äîif SMOTE is unavailable or unsuitable, the code prints an informative message.  
- **Keeps the research workflow flexible**, allowing the model to work with either the original or resampled dataset.

---

### ‚≠ê **Why This Cell Is Important for the Research**
Astronomical datasets often contain **rare events**, such as unusual transients or special categories of objects. These rare classes are usually the most scientifically interesting, but machine-learning models struggle to learn from them when the dataset is imbalanced. By applying SMOTE, the research ensures that minority categories receive equal representation during training. This leads to a more reliable, fair, and scientifically meaningful model that does not ignore rare events. Balancing the dataset is essential for achieving stable and accurate predictions in astroinformatics applications.

</details>


In [12]:
# 8) Handle class imbalance (try SMOTE if a target column exists)
from collections import Counter
possible_targets = [c for c in df.columns if c.lower() in ['label','class','target','type']]
if possible_targets:
    target = possible_targets[0]
    print('Target column detected:', target)
    print('Class counts before:', Counter(df[target]))
    try:
        from imblearn.over_sampling import SMOTE
        features = df.drop(columns=[target]).select_dtypes(include=['number']).columns.tolist()
        if features:
            X = df[features]
            y = df[target]
            sm = SMOTE(random_state=42)
            X_res, y_res = sm.fit_resample(X, y)
            df_res = pd.DataFrame(X_res, columns=features)
            df_res[target] = y_res
            print('Resampled dataset shape:', df_res.shape)
        else:
            print('No numeric feature columns available for SMOTE; skipping resampling.')
    except Exception as e:
        print('SMOTE failed or imblearn not installed:', e)
else:
    print('No obvious target column found; skipping imbalance handling')

No obvious target column found; skipping imbalance handling


<details>
<summary>üìå Cell Description: Final Verification and Saving the Cleaned Datasets</summary>

This cell performs the final confirmation steps in the cleaning pipeline and saves the fully processed datasets to disk. After all earlier stages‚Äîsuch as removing missing values, fixing invalid entries, handling outliers, engineering new features, and balancing classes‚Äîthe dataset is now clean, consistent, and ready for analysis or machine-learning. The cell prints the final shape of the dataset, displays the data types, and then exports the cleaned DataFrame into a new CSV file for future use. If a standardized version of the dataset was created earlier, this cell also saves that version separately. Finally, it previews the first few rows so the researcher can visually confirm that the dataset looks correct.

This saving process is essential because it freezes the cleaned dataset in a stable and reproducible format. Future analysis, modelling, visualizations, or external collaborators can now use the same prepared dataset without rerunning the entire cleaning pipeline.

---

### üîπ **Key Points (Simple & Attractive Explanation)**

- **Verifies the final shape of the dataset**, ensuring no unexpected row or column changes occurred during cleaning.  
- **Displays the final data types**, confirming that numerical and categorical columns were processed correctly.  
- **Saves the cleaned dataset** as `ztf_image_search_results_full_cleaned.csv` for easy reuse.  
- **Optionally saves the standardized dataset**, if it was created earlier, ensuring both raw-cleaned and scaled versions are available.  
- **Provides a quick preview (`df.head()`)** so the researcher can visually inspect the final output.  
- **Ensures reproducibility**, because future experiments can load the exact same dataset without repeating all cleaning steps.  
- **Supports collaboration**, as other researchers or viva examiners can view the cleaned dataset directly.  
- **Marks the completion of the full data-preprocessing pipeline**, allowing the research to move on to modelling, analysis, or visualization.

---

### ‚≠ê **Why This Cell Is Important for the Research**
Saving the cleaned and standardized datasets makes the entire preprocessing workflow transparent and repeatable‚Äîtwo core requirements of scientific research. Machine-learning experiments depend on consistent input data. Without saving the processed dataset, even small variations in cleaning steps could produce different results. This cell ensures that the research builds on a stable and reliable foundation, making every analysis step that follows scientifically valid, verifiable, and easy to reproduce. It also provides a clean dataset that can be shared with supervisors, collaborators, or the viva evaluation panel.

</details>


In [10]:
# 9) Final checks and save cleaned data
print('Final shape:', df.shape)
print(df.dtypes)
output_path = 'ztf_image_search_results_full_cleaned.csv'
df.to_csv(output_path, index=False)
print('Cleaned dataset saved to', output_path)
try:
    df_standardized.to_csv('ztf_image_search_results_full_standardized.csv', index=False)
    print('Standardized dataset saved to ztf_image_search_results_full_standardized.csv')
except NameError:
    pass
df.head()

Final shape: (62368, 44)
ra               float64
dec              float64
infobits           int64
field              int64
ccdid              int64
qid                int64
rcid               int64
fid                int64
filtercode        object
pid                int64
nid                int64
expid              int64
itid               int64
imgtype           object
imgtypecode       object
obsdate           object
obsjd            float64
exptime            int64
filefracday        int64
seeing           float64
airmass          float64
moonillf         float64
moonesb            int64
maglimit         float64
crpix1           float64
crpix2           float64
crval1           float64
crval2           float64
cd11             float64
cd12             float64
cd21             float64
cd22             float64
ra1              float64
dec1             float64
ra2              float64
dec2             float64
ra3              float64
dec3             float64
ra4              float64


Unnamed: 0,ra,dec,infobits,field,ccdid,qid,rcid,fid,filtercode,pid,...,ra2,dec2,ra3,dec3,ra4,dec4,ipac_pub_date,ipac_gid,log1p_exptime,log1p_airmass
0,142.513076,22.239831,67108864,570,16,1,60,2,zr,769412526015,...,142.036715,22.664666,142.055997,21.798581,142.986538,21.813558,2020-12-09 00:00:00+00,2,3.433987,0.75283
1,141.637575,21.360661,0,570,16,3,62,2,zr,1503405246215,...,141.161603,21.782992,141.186132,20.917351,142.110832,20.937013,2022-09-07 00:00:00+00,2,3.433987,0.755183
2,141.618681,22.225694,0,570,16,2,61,2,zr,1504338726115,...,141.139773,22.647755,141.164682,21.781999,142.094901,21.802092,2022-09-07 00:00:00+00,2,3.433987,0.710004
3,141.637182,19.443859,0,570,12,3,46,2,zr,1521276364615,...,141.167231,19.866253,141.191126,19.000577,142.104704,19.020278,2022-09-07 00:00:00+00,3,3.433987,0.709513
4,141.637223,19.440343,0,570,12,3,46,1,zg,1519302464615,...,141.167275,19.862742,141.191122,18.996967,142.104734,19.01676,2021-06-30 00:00:00+00,1,3.433987,0.710496
