{"payload":{"feedbackUrl":"https://github.com/orgs/community/discussions/53140","repo":{"id":280932109,"defaultBranch":"master","name":"annotated_research_papers","ownerLogin":"AakashKumarNain","currentUserCanPush":false,"isFork":false,"isEmpty":false,"createdAt":"2020-07-19T19:05:19.000Z","ownerAvatar":"https://avatars.githubusercontent.com/u/11736571?v=4","public":true,"private":false,"isOrgOwned":false},"refInfo":{"name":"","listCacheKey":"v0:1623871639.6092129","currentOid":""},"activityList":{"items":[{"before":"d43e955fa6abf607a20dbe460f1e75513b62435e","after":"761530531889f4e1a5838a1bf0b1dccbf8e5077d","ref":"refs/heads/master","pushedAt":"2024-05-21T14:19:28.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"AakashKumarNain","name":"Aakash Kumar Nain","path":"/AakashKumarNain","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/11736571?s=80&v=4"},"commit":{"message":"add Visual fact checkr paper","shortMessageHtmlLink":"add Visual fact checkr paper"}},{"before":"5417d2028fed6f44360de119b0b073a757601d62","after":"d43e955fa6abf607a20dbe460f1e75513b62435e","ref":"refs/heads/master","pushedAt":"2024-05-21T14:11:18.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"AakashKumarNain","name":"Aakash Kumar Nain","path":"/AakashKumarNain","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/11736571?s=80&v=4"},"commit":{"message":"1. Paper from Nvidia focusing on hallucination problems in the existing VLMs.\n2. Overly concise captions, hallucinated details about the objects in the scene, incorrect object counting, etc., are a few common problems observed in captioning-based VLMs.\n3. To address the hallucinations, the authors propose a **training-free** pipeline that utilizes visual grounding tools for improved accuracy and offers higher fidelity captions for both 2D and 3D\n4. Though the pipeline is more verbose for 3D objects, both 2D and 3D workflows share a portion that includes: Proposal, Verification, and Captioning steps.\n5. The authors also propose a new metric: CLIP-Image score. CLIP score alone is not sufficient to verify if the final captions are good enough. CLIP-Image-Score offers a sensitive measure for detecting hallucinations. ","shortMessageHtmlLink":"1. Paper from Nvidia focusing on hallucination problems in the existi…"}},{"before":"f0f1c80adc073661f353207c9e6ed51003cdb9a3","after":"5417d2028fed6f44360de119b0b073a757601d62","ref":"refs/heads/master","pushedAt":"2024-04-24T12:00:14.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"AakashKumarNain","name":"Aakash Kumar Nain","path":"/AakashKumarNain","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/11736571?s=80&v=4"},"commit":{"message":"add ferret_v2 paper","shortMessageHtmlLink":"add ferret_v2 paper"}},{"before":"dd48102e80594ac50a64c0e123c4dbbafbb147eb","after":"f0f1c80adc073661f353207c9e6ed51003cdb9a3","ref":"refs/heads/master","pushedAt":"2024-04-24T11:59:02.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"AakashKumarNain","name":"Aakash Kumar Nain","path":"/AakashKumarNain","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/11736571?s=80&v=4"},"commit":{"message":"1. Paper from Apple\n2. Extension of the earlier MLLM from Apple i.e. Ferret\n3. Focuses on the Problems with the existing VLMs/MLLMs, even after a certain number of improvements were proposed in the past to resolve these.\n4. Why grounding and referring are necessary but not sufficient to improve the performance of existing MLLMs, including the Ferret v1\n5. The importance of image resolution for tasks where a fine-grained understanding of the objects is required.\n6. How to mix local and global features from two different visual encoders?","shortMessageHtmlLink":"1. Paper from Apple"}},{"before":"02340ae7fdb1c84740518ea464a8ad533b07b23c","after":"dd48102e80594ac50a64c0e123c4dbbafbb147eb","ref":"refs/heads/master","pushedAt":"2024-03-19T10:51:01.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"AakashKumarNain","name":"Aakash Kumar Nain","path":"/AakashKumarNain","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/11736571?s=80&v=4"},"commit":{"message":"update MM1 paper from Apple","shortMessageHtmlLink":"update MM1 paper from Apple"}},{"before":"18059ac48001a0084715ef3f48ac908706d94513","after":"02340ae7fdb1c84740518ea464a8ad533b07b23c","ref":"refs/heads/master","pushedAt":"2024-03-19T10:47:01.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"AakashKumarNain","name":"Aakash Kumar Nain","path":"/AakashKumarNain","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/11736571?s=80&v=4"},"commit":{"message":"1. Paper from Apple\n2. Discussed the various choices related to datasets, model architectures, and evaluation for pre-training and SFT for MLLMs\n3. Proper ablation of all the above\n4. Creates a new SOTA multimodal MoE style model MM1 with above design chocies","shortMessageHtmlLink":"1. Paper from Apple"}},{"before":"bcdf47d1dc01e4254ad83551318357a65e21e077","after":"18059ac48001a0084715ef3f48ac908706d94513","ref":"refs/heads/master","pushedAt":"2024-02-09T13:41:29.000Z","pushType":"push","commitsCount":5,"pusher":{"login":"AakashKumarNain","name":"Aakash Kumar Nain","path":"/AakashKumarNain","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/11736571?s=80&v=4"},"commit":{"message":"add shortened llama paper","shortMessageHtmlLink":"add shortened llama paper"}},{"before":"fc3add4bc95b148a0579bf55923bf42ae7d3fa26","after":"bcdf47d1dc01e4254ad83551318357a65e21e077","ref":"refs/heads/master","pushedAt":"2024-01-10T13:29:04.000Z","pushType":"push","commitsCount":2,"pusher":{"login":"AakashKumarNain","name":"Aakash Kumar Nain","path":"/AakashKumarNain","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/11736571?s=80&v=4"},"commit":{"message":"add VCoder paper","shortMessageHtmlLink":"add VCoder paper"}},{"before":"dcb1c089bb66a40940f09df08acc6c17077dbe0b","after":"fc3add4bc95b148a0579bf55923bf42ae7d3fa26","ref":"refs/heads/master","pushedAt":"2023-12-04T13:11:57.000Z","pushType":"push","commitsCount":2,"pusher":{"login":"AakashKumarNain","name":"Aakash Kumar Nain","path":"/AakashKumarNain","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/11736571?s=80&v=4"},"commit":{"message":"update mobileclip","shortMessageHtmlLink":"update mobileclip"}},{"before":"0ce274f2a55bc8e8f7e1aa2d35a96c741ea88202","after":"dcb1c089bb66a40940f09df08acc6c17077dbe0b","ref":"refs/heads/master","pushedAt":"2023-10-19T16:32:18.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"AakashKumarNain","name":"Aakash Kumar Nain","path":"/AakashKumarNain","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/11736571?s=80&v=4"},"commit":{"message":"add vit_need_registers paper","shortMessageHtmlLink":"add vit_need_registers paper"}},{"before":"ac6d3042d42ba6df1b37ed342d8f30bb404e6b01","after":"0ce274f2a55bc8e8f7e1aa2d35a96c741ea88202","ref":"refs/heads/master","pushedAt":"2023-10-16T13:28:25.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"AakashKumarNain","name":"Aakash Kumar Nain","path":"/AakashKumarNain","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/11736571?s=80&v=4"},"commit":{"message":"add vision transformers need registers paper","shortMessageHtmlLink":"add vision transformers need registers paper"}},{"before":"06882a2bea4c511bbd971a137b89e258f56d1350","after":"ac6d3042d42ba6df1b37ed342d8f30bb404e6b01","ref":"refs/heads/master","pushedAt":"2023-09-20T07:43:53.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"AakashKumarNain","name":"Aakash Kumar Nain","path":"/AakashKumarNain","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/11736571?s=80&v=4"},"commit":{"message":"add index for sigmoid loss for image text pretraining","shortMessageHtmlLink":"add index for sigmoid loss for image text pretraining"}},{"before":"61de3c6a1f25b6e5b6449f7417d18cf4c354c50c","after":"06882a2bea4c511bbd971a137b89e258f56d1350","ref":"refs/heads/master","pushedAt":"2023-09-20T07:35:10.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"AakashKumarNain","name":"Aakash Kumar Nain","path":"/AakashKumarNain","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/11736571?s=80&v=4"},"commit":{"message":"format readme","shortMessageHtmlLink":"format readme"}},{"before":"8e1c2011674270fb6dd11d43a1447f7cea5f8f54","after":"61de3c6a1f25b6e5b6449f7417d18cf4c354c50c","ref":"refs/heads/master","pushedAt":"2023-08-22T12:50:01.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"AakashKumarNain","name":"Aakash Kumar Nain","path":"/AakashKumarNain","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/11736571?s=80&v=4"},"commit":{"message":"1. Paper from Google Brain Zurich\n2. Proposes sigmoid loss function for language-image pretraining\n3. Discusses the benefits, and efficieny of sigmoid over softmax for the same task\n4. Discusses the findings like the effect of beta2 value, batch composition, etc","shortMessageHtmlLink":"1. Paper from Google Brain Zurich"}},{"before":"ed0ec70ff374716f6ca759aeac39fadfef605949","after":"8e1c2011674270fb6dd11d43a1447f7cea5f8f54","ref":"refs/heads/master","pushedAt":"2023-06-21T05:30:53.539Z","pushType":"push","commitsCount":1,"pusher":{"login":"AakashKumarNain","name":"Aakash Kumar Nain","path":"/AakashKumarNain","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/11736571?s=80&v=4"},"commit":{"message":"add Emergent Correspondence from Diffusion Models paper","shortMessageHtmlLink":"add Emergent Correspondence from Diffusion Models paper"}},{"before":"9d6b30cfa67384015c30200a4201fddb63759c48","after":"ed0ec70ff374716f6ca759aeac39fadfef605949","ref":"refs/heads/master","pushedAt":"2023-06-14T13:38:23.107Z","pushType":"push","commitsCount":1,"pusher":{"login":"AakashKumarNain","name":"Aakash Kumar Nain","path":"/AakashKumarNain","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/11736571?s=80&v=4"},"commit":{"message":"1. Paper from Cornell University\n2. Talks about visual correspondences that is emergent in diffusion models","shortMessageHtmlLink":"1. Paper from Cornell University"}},{"before":"bce8d85829a372f637692e0a7c78ca2cfe44f7ee","after":"9d6b30cfa67384015c30200a4201fddb63759c48","ref":"refs/heads/master","pushedAt":"2023-05-04T12:29:39.000Z","pushType":"push","commitsCount":2,"pusher":{"login":"AakashKumarNain","name":"Aakash Kumar Nain","path":"/AakashKumarNain","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/11736571?s=80&v=4"},"commit":{"message":"update Readme","shortMessageHtmlLink":"update Readme"}},{"before":"728456067abba144502b1f191438b1e7ed7baf7a","after":"bce8d85829a372f637692e0a7c78ca2cfe44f7ee","ref":"refs/heads/master","pushedAt":"2023-04-28T14:41:52.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"AakashKumarNain","name":"Aakash Kumar Nain","path":"/AakashKumarNain","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/11736571?s=80&v=4"},"commit":{"message":"1. Paper from FAIR\n2. Discusses on how to build a foundational model for segmentation tasks\n3. Discusses the data aspect, engineering aspects, and zero-shot capabilities in detail","shortMessageHtmlLink":"1. Paper from FAIR"}},{"before":"bc835d6eeff26bb431e8d25361a0a3c96e67670b","after":"728456067abba144502b1f191438b1e7ed7baf7a","ref":"refs/heads/master","pushedAt":"2023-03-11T17:05:02.063Z","pushType":"push","commitsCount":1,"pusher":{"login":"AakashKumarNain","name":"Aakash Kumar Nain","path":"/AakashKumarNain","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/11736571?s=80&v=4"},"commit":{"message":"add code for WhisperX","shortMessageHtmlLink":"add code for WhisperX"}},{"before":"0fd03b992bffbd44f1b53faa74653e5556e5abcf","after":"bc835d6eeff26bb431e8d25361a0a3c96e67670b","ref":"refs/heads/master","pushedAt":"2023-03-10T14:31:53.731Z","pushType":"push","commitsCount":1,"pusher":{"login":"AakashKumarNain","name":"Aakash Kumar Nain","path":"/AakashKumarNain","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/11736571?s=80&v=4"},"commit":{"message":"add WhisperX paper","shortMessageHtmlLink":"add WhisperX paper"}},{"before":"025ef1874cadf22e7bc448b4062f102cf276d427","after":"0fd03b992bffbd44f1b53faa74653e5556e5abcf","ref":"refs/heads/master","pushedAt":"2023-03-10T14:20:17.154Z","pushType":"push","commitsCount":1,"pusher":{"login":"AakashKumarNain","name":"Aakash Kumar Nain","path":"/AakashKumarNain","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/11736571?s=80&v=4"},"commit":{"message":"1. Paper from Visual Geometry Group\n2. Explains why Whisper isn't enough for long-from audio transcription\n3. Proposes three additional modules to address the same: VAD, cut-and-merge, phoneme detection model\n4. Simple and scalable","shortMessageHtmlLink":"1. Paper from Visual Geometry Group"}}],"hasNextPage":false,"hasPreviousPage":false,"activityType":"all","actor":null,"timePeriod":"all","sort":"DESC","perPage":30,"cursor":"djE6ks8AAAAET_HGugA","startCursor":null,"endCursor":null}},"title":"Activity · AakashKumarNain/annotated_research_papers"}