LipReading-Webapp

you can refer our paper - https://www.preprints.org/manuscript/202312.0928/v1

We have developed a web application that can generate subtitles for any video uploaded in which the person’s mouth is visible. We have developed our model using 3D-convolution network and LSTM bidirectional. Our model can make sentence level predictions based only on visual moments of lips. The lips region is segmented and created a animated gif and this was used as pre-trained data to train our model.

Model

The segmentation of mouth region is done statically and by using imageio and use its mimsave create an animation gif that our model will learn to decode.

Fig-1 : The segmented region

To train our model we used CNN and LSTM combined architecture as it is a powerful approach to process sequential data with both spatial and temporal dependencies. The 3D convolution Network is better for working with videos. Similar to 2D convolution(spatial convolution), 3D convolution works by moving a kernel (also called a filter) across the input data to extract local characteristics. The kernel's size corresponds to the depth, height, and width of the data, and it spatially moves over the input volume. The kernel calculates the element-wise dot product with the relevant input sub-volume at each point. To create a feature map with high-level representations of the input volume, this process is repeated for all points. The output of the CNN is then fed into the LSTM as sequential data, where the LSTM captures temporal dependencies and patterns. In this architecture dense, dropout and bidirectional e able to convert paths through Temporeal component while using LSTM. As for the optimizer adam optimizer was used the final model was implemented using streamlit.

RESULTS

We propose a web application LipReader where user can upload any video in which mouth movements are visible in the frame and the get the predicted end-to-end sentence level text which can be used as subtitles for the video. The accuracy of our model is 97% . The predicted text can reduce the time for subtitles creators and people with hearing impairments. Also people who can’t speak can record there videos and get predicted text and use text to audio convertor to have there own voice instead of using sign language. Refer ppt for better understanding

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
app		app
models		models
Dockerfile		Dockerfile
Mini Project Template (1).pptx		Mini Project Template (1).pptx
README.md		README.md
lipread.ipynb		lipread.ipynb
lipreader.ipynb		lipreader.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LipReading-Webapp

Model

RESULTS

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LipReading-Webapp

Model

RESULTS

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages