Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

a few tweaks to the readmes to clarify and added requirements file #49

Merged
merged 37 commits into from
Jan 16, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
e762e9f
add readme
Oct 6, 2023
53ddf63
add project outline
Oct 6, 2023
5d7f685
fix authors
Oct 6, 2023
07200a8
add team agreement
Oct 6, 2023
edfa7dc
Merge pull request #39 from BU-Spark/project_outline
mahaveer220 Oct 10, 2023
b6eeb8a
project research
Oct 16, 2023
bdc258c
Update research.md
notgitika Oct 24, 2023
6c5b851
nlp analysis
navyajain4 Oct 28, 2023
1b13aaa
Revert "nlp analysis"
Oct 28, 2023
b823d90
dataexplore
Oct 28, 2023
844f6e2
eda_nlp
navyajain4 Oct 28, 2023
926365a
add sample data with naming convenitons
Oct 28, 2023
d5b203f
Merge pull request #40 from BU-Spark/project_research
mahaveer220 Oct 28, 2023
26abce9
Merge pull request #41 from BU-Spark/eda
rithvik-doshi Nov 6, 2023
8a65d63
results visualizer
Nov 25, 2023
fcc7714
Merge pull request #44 from BU-Spark/deployment
saivarshith06 Nov 29, 2023
d661885
display images in a grid
Dec 9, 2023
a9399c0
change the slider to select_slider, add metric and model info
Dec 9, 2023
af8a14c
add code explaining comments
Dec 12, 2023
08a0c61
add sample view and readme file
Dec 12, 2023
6d89f66
Merge pull request #45 from BU-Spark/deployment_2
saivarshith06 Dec 13, 2023
746fd22
add flow documentation
Dec 15, 2023
f94e9e3
add table
Dec 15, 2023
a31db17
commit final code with documentation
Dec 15, 2023
9051e19
Revert "commit final code with documentation"
Dec 15, 2023
6f3d01e
remove unwanted files
Dec 15, 2023
40bfabe
add face_detector_results
Dec 15, 2023
eebc07a
clean code, add documentation, final push
Dec 15, 2023
8ad1820
Merge pull request #46 from BU-Spark/dev_mini
saivarshith06 Dec 15, 2023
11d7155
add env recreation instructions
Dec 20, 2023
d3e1412
Merge pull request #47 from BU-Spark/envcreate
kabilanmohanraj Dec 20, 2023
a5568d8
Update README.md
saivarshith06 Dec 20, 2023
7540d7c
Update README.md
saivarshith06 Dec 20, 2023
f806e5d
Update README.md
saivarshith06 Dec 20, 2023
9c26db8
Update README.md
saivarshith06 Dec 20, 2023
20bfaa1
Update README.md
saivarshith06 Dec 20, 2023
ebaddb5
a few small edits to clarify readmes
trgardos Dec 21, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# Documentation

This is the README for the Spring 2023 course project. For Fall 2023,
see this [README](./fall23/README.md).

## Roadmap of Repository Directories
- [`DeploymentFinalNBs`](https://github.com/BU-Spark/ml-atfal-mafkoda-missing-children/tree/dev/Deployment:FinalNBs) - contains the deployment notebooks for the face aging demo as well as the facial recognition demo
Expand Down
194 changes: 194 additions & 0 deletions fall23/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,194 @@

# Flow Documentation
This document provides an overview of the key components in our ML Pipeline. Please refer to each file for in depth explanation.

## Key Components

**JSON Data File**
- Description : This is a JSON file that contains all relevant Facebook data. It serves as the primary dataset.
An example JSON file is in the `inputs_outputs` folder.

**Data Downloader**
- Function : Downloads all images associated with a specific `CaseId`.
- Output : Images are stored in a structured directory named `case_status/case_id`.

**Data Organizer**
- Function : Aggregates enhanced images from each `CaseId`.
- Organization Method : Images are sorted into folders based on their `case_status`.

**Face Detector**
- Function : Identifies any faces that are not detectable and Removes non-detectable faces from the processing queue.
- Output : Non-detectable faces are listed in a CSV file.

**Face Matcher**
- Function : Compares each individual's face against a directory of 'John Does'.
- Output : Generates a CSV file listing individuals who resemble 'John Does' along with their respective similarity scores.


<img src="flow_chart.png" alt="Flow Chart" width="750"/>

## data_downloader.ipynb

This notebook automates the downloading and organizing of images based on JSON Facebook posts data. Key features include:
- **Downloading Images**: It downloads images from URLs found in a JSON file and saves them in specified directories.
- **Organizing Data**: Images are organized by case status and IDs into separate folders.
- **Error Logging**: Missing data from the JSON file is logged into a CSV file for review.


## data_explore.ipynb
The `data_explore` directory includes the `data_explore.ipynb` notebook, which specializes in analyzing and visualizing the given Facebook posts data. Key components and outputs of the notebook are outlined below:

### Data Conversion and Preprocessing
- The notebook begins by converting data from a JSON file to a CSV format using the `JsonToCsv` class, facilitating easier data manipulation and analysis with pandas.
- Essential preprocessing steps include cleaning text data and extracting relevant information like case IDs, case status, and other pertinent details.

### Data Analysis and Visualization
- **Case Status Analysis**: The notebook groups data by case status and counts unique case IDs. A bar graph is then generated to visualize the distribution of cases across different statuses, such as 'Missing', 'Reunited', and 'John Doe'.
- **Age Extraction and Analysis**: It includes a function to extract age from text data, particularly focusing on Arabic text. The notebook then analyzes the age distribution for different case statuses, providing insights into the age profiles of missing or unidentified individuals. Line graphs are plotted to show age frequencies, highlighting mean, median, and mode ages.

### Enhanced Image Analysis
- The code navigates through a structured directory containing image data, counting the number of enhanced images in specific subfolders. This part of the notebook is crucial for understanding the volume and distribution of enhanced images across different case IDs and post indexes.
- For each post index and case status, bar graphs are generated to display the frequency of case IDs with varying numbers of enhanced images. These visualizations are instrumental in identifying patterns or anomalies in the image enhancement process associated with the cases.


## data_organizer.ipynb

- **Input Directory**: Accepts the directory containing image data obtained after downloading using `data_downloader.ipynb`, along with the specified output directory for organized images.
- **Selective Extraction**: The script processes only specified folders within each image set, such as 'enhanced' images, based on user input.

### Output Structure
- **Root Directory**: The output is stored in a specified folder.
- **Organized Directory Structure**: Within this root directory, the images are systematically categorized:
- **Case Status Subdirectories**: Images are sorted into subdirectories named after each case status, such as 'Missing' and 'John Doe'.
- **Image Type Folders**: Within each case status folder, images are further organized into folders based on their type (e.g., 'enhanced').
- **Image Files**: These folders contain the relevant images, which are consistently named and sorted, facilitating easy location and access to specific images or sets of images.

This organized structure greatly simplifies the management and access of image categories, streamlining the process for further analysis or use in subsequent pipeline steps, such as `face_detector.ipynb`.


## face_detector_results

This directory contains outcomes from the `face_detector.ipynb` notebook, which focuses on facial detection—a crucial step before proceeding to `face_matching.ipynb`.
Key elements and functionalities :

### `FaceDetector` Class Overview
- Implements face detection on a set of images using the `DeepFace` library, capable of handling images where no faces are detected.

### Processing Workflow
- **Backend Selection**: Enables the selection of detection algorithms such as 'opencv', 'mtcnn','ssd', 'dlib', 'retinaface', 'mediapipe', 'yolov8', 'yunet','fastmtcnn’.
- **Face Detection Execution**: Processes images iteratively for face detection, using algorithms specified by the user.
- **Handling Non-Detected Images**: Depending on user preferences, images without detected faces are either deleted or moved to a designated folder.
- **Output Creation**: Produces a CSV file that details the filenames of images where faces were not detected, facilitating further analysis.
- **Graphical Representation**: Exhibits a bar chart to visually compare the count of non-detectable images across combinations like 'johndoe-mtcnn' and 'missing-opencv'.

-`non_detect_johndoe_mtcnn.csv` and `non_detect_missing_mtcnn.csv` : CSV files with the filenames of images where faces were not detected when used mtcnn detector backend.

- `non_detect_johndoe_opencv.csv` and `non_detect_missing_opencv.csv`: CSV files with the filenames of images where faces were not detected when used opencv detector backend.

- `non_detected_faces_missing` and `non_detected_faces_johndoe`: Contain images from the 'missing' and 'John Doe' categories where faces were not detected using mtcnn.

- We discovered that using MTCNN drastically reduced the number of non-detectable images, leading us to continue utilizing MTCNN in our pipeline, as it proved to be more effective compared to OpenCV.


## inputs_outputs

This directory contains essential files used in and generated by the project's notebooks. This includes a JSON file(response_1699422658272.json) with Facebook posts data, which serves as input for `data_downloader.ipynb`, the `jsontocsv.csv` file generated by `data_explore.ipynb`, and two output CSV files – `matches_cosine.csv` and `matches_euclideanl2.csv` – produced using various metrics in `face_matching.ipynb`.

### Summary of `face_matching.ipynb`
The `face_matching.ipynb` notebook plays a vital role in performing facial matching between two sets of images, utilizing a range of face recognition models and metrics. Its key features are as follows:

#### `FaceRecognition` Class
- **Purpose**: Matches faces between two datasets – one containing images of missing persons and another of unidentified individuals (John Doe).
- **Functionality**: Employs the `DeepFace` library for advanced face recognition, leading to the creation of detailed CSV files with match results.
- In our pipeline, the MTCNN detector backend is used for face detection, and the VGG-Face model is employed for face recognition.

#### Key Functionalities
- **Model Selection**: Supports various face recognition models, including "VGG-Face", "Facenet", "Facenet512", "OpenFace", "DeepFace", "DeepID", "ArcFace", "Dlib", and "SFace".
- **Metric Options**: Provides the choice of multiple metrics for comparison, such as "cosine", "euclidean", and "euclidean_l2".
- **Face Matching Process**: Processes images iteratively from specified directories, applying face detection and matching techniques. The process can be customized using the different metrics and models.
- **Output Generation**: Generates CSV files listing the matches found, potentially identifying unknown individuals based on the selected model and metric.
- **Additional Output**: A pickle file representing image embeddings is also generated, enhancing the data analysis capabilities.

These features make the `face_matching.ipynb` notebook a comprehensive tool for facial recognition tasks, offering versatility and precision in identifying unknown individuals in various scenarios.

**Note on CSV File Structure**

- **`missing_filename` Column**: This column contains the filenames of the images for which matches are being sought. These images typically represent missing individuals or subjects of interest.

- **`unknowns_matched_filenames` Column**: This column lists the filenames from the pool of resulted matches from specific interest like johndoe individuals. These are the images identified as potential matches for the filenames listed in the `missing_filename` column.


## deployment
This directory includes all the Streamlit deployment files.


## experiment_results_analysis

This directory contains all the essential files and notebooks related to the results and analysis of the experiment. Below is an overview of each file:

- **exp_result_analysis_cosine.ipynb** and **exp_result_analysis_euclidean_l2.ipynb**: Jupyter notebooks dedicated to analyzing the face matching results obtained with the Cosine and Euclidean L2 metrics, respectively. They also include data processing, visualization, and statistical analysis tailored to each specific metric.

- **exp_matches_cosine.csv** and **exp_matches_euclideanl2.csv**: These files consist of all face matches calculated in `face_matching.ipynb` using the Cosine and Euclidean L2 metrics, respectively. They provide foundational datasets for understanding the nuances of facial recognition using these metrics.

- **aggregated_data_cosine.csv** and **aggregated_data_euclideanl2.csv**: Generated by `exp_result_analysis_cosine.ipynb` and `exp_result_analysis_euclidean_l2.ipynb` respectively, these files result from the data aggregation process.

**Note on `exp_result_analysis_cosine.ipynb` and `exp_result_analysis_euclidean_l2.ipynb` files**

- The term ‘Relevant matches’ refers to the faces that are correctly matched in the notebooks `exp_result_analysis_cosine.ipynb` and `exp_result_analysis_euclidean_l2.ipynb`. These matches are considered accurate or 'relevant' based on the specific metrics (Cosine or Euclidean L2) used in these notebooks.


## Experiment Overview

### Case Selection
The experiment begins by selecting case IDs that have at least two images of missing individuals. This ensures a sufficient dataset for effective comparison and analysis.

### Face Detection and Matching
- Images are initially processed through `face_detector.ipynb`, employing the MTCNN algorithm for face detection.
- Detected faces are matched using `face_matching.ipynb`. This notebook uses DeepFace with 'VGG-Face' as the model for recognition and 'mtcnn' as the detector backend. Matching is performed using two metrics: Euclidean L2 and Cosine.

### Generation of Match Results
The matching process yields two CSV files: `exp_matches_euclideanl2.csv` and `exp_matches_cosine.csv`. These files document the details of matched faces based on the respective metrics.
These CSV files were sent into 'exp_result_analysis_euclidean_l2.ipynb' and'exp_result_analysis_cosine.ipynb' files for further analysis.

## Data Analysis Process

### Preprocessing Stage
- **Removing Self-Matches**: Self-matches are identified and eliminated from the match results. This is crucial since matching missing individuals' images against their own set often results in identical matches, skewing the analysis.
- **Eliminating Duplicates**: The dataset is further cleaned by removing duplicate entries, ensuring uniqueness and validity of each data point.
- **Extracting Key Information**: Essential information like case IDs is extracted, which is crucial for individual case tracking and analysis.

### Aggregation and Data Refinement
- **Grouping by Case ID**: The data is organized by case IDs, collating all relevant information for each case. This facilitates more focused analysis on a case-by-case basis.
- **Resulting Data Files**: The aggregation process results in two key files: `aggregated_data_cosine.csv` and `aggregated_data_euclideanl2.csv`.

### Analysis of Match Counts
- **Counting Correct Matches**: The analysis includes counting the number of correct matches for each case ID from the pool of matches obtained through the face matching process.
- **Frequency and Distribution Insights**: The match counts are analyzed to understand the frequency of matches per case and their distribution across the dataset.
- **Precision of Matches**: This part of the analysis evaluates the precision of matches by examining score ranges and the position of the first match in the top results, crucial for assessing the accuracy of early matches.

## Result Analysis via Notebooks
Two notebooks, exp_result_analysis_euclidean_l2.ipynb and exp_result_analysis_cosine.ipynb, are utilized to analyze the results based on the Euclidean_L2 and Cosine metrics, respectively. While they follow parallel methodologies, each notebook specifically focuses on a distinct metric.

## Euclidean l2 Vs Cosine

| Criteria | Euclidean_l2 | Cosine |
|---|---|---|
| Total Case IDs Evaluated | 249 | 249 |
| Case IDs with No Matches | 26 | 18 |
| Case IDs with At Least One Match | 223 | 231 |
| Avg Pool Size | Small (221) | Large (336) |
| Probability of Finding Exact Match | Good | Slightly Higher |
| Matches in Top 50 Pool | 85.7% (191/223) | 86.6% (200/231) |


# Install required packages:
- Create a virtual environment and pip install the following modules
```
pip install requests
pip install Pillow
pip install tqdm
pip install pandas
pip install seaborn
pip install matplotlib
pip install deepface
```
Binary file added fall23/Team_Agreement.pdf
Binary file not shown.
Loading